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Foreword 


Machine learning is the latest in a long line of attempts to distill human 
knowledge and reasoning into a form that is suitable for constructing ma- 
chines and engineering automated systems. As machine learning becomes 
more ubiquitous and its software packages become easier to use, it is nat- 
ural and desirable that the low-level technical details are abstracted away 
and hidden from the practitioner. However, this brings with it the danger 
that a practitioner becomes unaware of the design decisions and, hence, 
the limits of machine learning algorithms. 

The enthusiastic practitioner who is interested to learn more about the 
magic behind successful machine learning algorithms currently faces a 
daunting set of pre-requisite knowledge: 


= Programming languages and data analysis tools 
= Large-scale computation and the associated frameworks 
= Mathematics and statistics and how machine learning builds on it 


At universities, introductory courses on machine learning tend to spend 
early parts of the course covering some of these pre-requisites. For histori- 
cal reasons, courses in machine learning tend to be taught in the computer 
science department, where students are often trained in the first two areas 
of knowledge, but not so much in mathematics and statistics. 

Current machine learning textbooks primarily focus on machine learn- 
ing algorithms and methodologies and assume that the reader is com- 
petent in mathematics and statistics. Therefore, these books only spend 
one or two chapters on background mathematics, either at the beginning 
of the book or as appendices. We have found many people who want to 
delve into the foundations of basic machine learning methods who strug- 
gle with the mathematical knowledge required to read a machine learning 
textbook. Having taught undergraduate and graduate courses at universi- 
ties, we find that the gap between high school mathematics and the math- 
ematics level required to read a standard machine learning textbook is too 
big for many people. 

This book brings the mathematical foundations of basic machine learn- 
ing concepts to the fore and collects the information in a single place so 
that this skills gap is narrowed or even closed. 
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“Math is linked in 
the popular mind 
with phobia and 
anxiety. You’d think 
we're discussing 
spiders.” (Strogatz, 
2014, page 281) 


2 Foreword 


Why Another Book on Machine Learning? 


Machine learning builds upon the language of mathematics to express 
concepts that seem intuitively obvious but that are surprisingly difficult 
to formalize. Once formalized properly, we can gain insights into the task 
we want to solve. One common complaint of students of mathematics 
around the globe is that the topics covered seem to have little relevance 
to practical problems. We believe that machine learning is an obvious and 
direct motivation for people to learn mathematics. 

This book is intended to be a guidebook to the vast mathematical lit- 
erature that forms the foundations of modern machine learning. We mo- 
tivate the need for mathematical concepts by directly pointing out their 
usefulness in the context of fundamental machine learning problems. In 
the interest of keeping the book short, many details and more advanced 
concepts have been left out. Equipped with the basic concepts presented 
here, and how they fit into the larger context of machine learning, the 
reader can find numerous resources for further study, which we provide at 
the end of the respective chapters. For readers with a mathematical back- 
ground, this book provides a brief but precisely stated glimpse of machine 
learning. In contrast to other books that focus on methods and models 
of machine learning (MacKay, 2003; Bishop, 2006; Alpaydin, 2010; Bar- 
ber, 2012; Murphy, 2012; Shalev-Shwartz and Ben-David, 2014; Rogers 
and Girolami, 2016) or programmatic aspects of machine learning (Miller 
and Guido, 2016; Raschka and Mirjalili, 2017; Chollet and Allaire, 2018), 
we provide only four representative examples of machine learning algo- 
rithms. Instead, we focus on the mathematical concepts behind the models 
themselves. We hope that readers will be able to gain a deeper understand- 
ing of the basic questions in machine learning and connect practical ques- 
tions arising from the use of machine learning with fundamental choices 
in the mathematical model. 

We do not aim to write a classical machine learning book. Instead, our 
intention is to provide the mathematical background, applied to four cen- 
tral machine learning problems, to make it easier to read other machine 
learning textbooks. 


Who Is the Target Audience? 


As applications of machine learning become widespread in society, we 
believe that everybody should have some understanding of its underlying 
principles. This book is written in an academic mathematical style, which 
enables us to be precise about the concepts behind machine learning. We 
encourage readers unfamiliar with this seemingly terse style to persevere 
and to keep the goals of each topic in mind. We sprinkle comments and 
remarks throughout the text, in the hope that it provides useful guidance 
with respect to the big picture. 

The book assumes the reader to have mathematical knowledge commonly 
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covered in high school mathematics and physics. For example, the reader 
should have seen derivatives and integrals before, and geometric vectors 
in two or three dimensions. Starting from there, we generalize these con- 
cepts. Therefore, the target audience of the book includes undergraduate 
university students, evening learners and learners participating in online 
machine learning courses. 

In analogy to music, there are three types of interaction that people 
have with machine learning: 

Astute Listener The democratization of machine learning by the pro- 
vision of open-source software, online tutorials and cloud-based tools al- 
lows users to not worry about the specifics of pipelines. Users can focus on 
extracting insights from data using off-the-shelf tools. This enables non- 
tech-savvy domain experts to benefit from machine learning. This is sim- 
ilar to listening to music; the user is able to choose and discern between 
different types of machine learning, and benefits from it. More experi- 
enced users are like music critics, asking important questions about the 
application of machine learning in society such as ethics, fairness, and pri- 
vacy of the individual. We hope that this book provides a foundation for 
thinking about the certification and risk management of machine learning 
systems, and allows them to use their domain expertise to build better 
machine learning systems. 

Experienced Artist Skilled practitioners of machine learning can plug 
and play different tools and libraries into an analysis pipeline. The stereo- 
typical practitioner would be a data scientist or engineer who understands 
machine learning interfaces and their use cases, and is able to perform 
wonderful feats of prediction from data. This is similar to a virtuoso play- 
ing music, where highly skilled practitioners can bring existing instru- 
ments to life and bring enjoyment to their audience. Using the mathe- 
matics presented here as a primer, practitioners would be able to under- 
stand the benefits and limits of their favorite method, and to extend and 
generalize existing machine learning algorithms. We hope that this book 
provides the impetus for more rigorous and principled development of 
machine learning methods. 

Fledgling Composer As machine learning is applied to new domains, 
developers of machine learning need to develop new methods and extend 
existing algorithms. They are often researchers who need to understand 
the mathematical basis of machine learning and uncover relationships be- 
tween different tasks. This is similar to composers of music who, within 
the rules and structure of musical theory, create new and amazing pieces. 
We hope this book provides a high-level overview of other technical books 
for people who want to become composers of machine learning. There is 
a great need in society for new researchers who are able to propose and 
explore novel approaches for attacking the many challenges of learning 
from data. 
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6 Foreword 
Table of Symbols 
Symbol Typical meaning 
a,b,c, œ, b, Y Scalars are lowercase 
£, Y, Z Vectors are bold lowercase 
A,B,C Matrices are bold uppercase 
x’, A" Transpose of a vector or matrix 
A Inverse of a matrix 
(£, Y) Inner product of x and y 
aly Dot product of x and y 
B = (bı, b2,b3) (Ordered) tuple 
B = [b,, b2,b3] Matrix of column vectors stacked horizontally 
B = {b;,b2,b3} Set of vectors (unordered) 
Z, N Integers and natural numbers, respectively 
R,C Real and complex numbers, respectively 
R” n-dimensional vector space of real numbers 
Yr Universal quantifier: for all x 
ar Existential quantifier: there exists x 
a:=b a is defined as b 
a=: b b is defined as a 
axb a is proportional to b, i.e., a = constant - b 
gof Function composition: “g after f” 
<> If and only if 
=> Implies 
A,C Sets 
acA a is an element of set A 
0 Empty set 
A\B A without B: the set of elements in A but not in B 
D Number of dimensions; indexed by d= 1,..., D 
N Number of data points; indexed by n = 1,...,N 
D Identity matrix of size m x m 
Onn Matrix of zeros of size m x n 
Lian Matrix of ones of size m x n 
e; Standard/canonical vector (where i is the component that is 1) 
dim Dimensionality of vector space 
rk(A) Rank of matrix A 
Im(®) Image of linear mapping ® 
ker(®) Kernel (null space) of a linear mapping ® 
span|bj| Span (generating set) of bı 
tr( A) Trace of A 
det(A) Determinant of A 
| - | Absolute value or determinant (depending on context) 
||-|| Norm; Euclidean, unless specified 
À Eigenvalue or Lagrange multiplier 
Ey Eigenspace corresponding to eigenvalue 
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Symbol Typical meaning 

aly Vectors x and y are orthogonal 

V Vector space 

Vt Orthogonal complement of vector space V 

LL Tn Sum of the £n: £1 +... + £N 

Ic, Le Product of the z,: 4, -...:@N 

0 Parameter vector 

ue Partial derivative of f with respect to x 

H Total derivative of f with respect to x 

V Gradient 

fe = min, f(x) The smallest function value of f 

x, E€ argmin, f(x) The value x, that minimizes f (note: arg min returns a set of values) 
£ Lagrangian 

L Negative log-likelihood 

(2) Binomial coefficient, n choose k 

Vx [x] Variance of x with respect to the random variable X 
Ex[zx] Expectation of x with respect to the random variable X 
Covy y[x, y] Covariance between <x and y. 

XILY|Z X is conditionally independent of Y given Z 

X ~p Random variable X is distributed according to p 

N (u, X) Gaussian distribution with mean p and covariance X 
Ber(u) Bernoulli distribution with parameter ju 

Bin(N, u) Binomial distribution with parameters N, u 

Beta(«, 8) Beta distribution with parameters a, 3 





Table of Abbreviations and Acronyms 





Acronym Meaning 








e.g. 
GMM 
i.e. 
i.i.d. 
MAP 
MLE 
ONB 
PCA 
PPCA 
REF 
SPD 
SVM 


Exempli gratia (Latin: for example) 
Gaussian mixture model 

Id est (Latin: this means) 

Independent, identically distributed 
Maximum a posteriori 

Maximum likelihood estimation/estimator 
Orthonormal basis 

Principal component analysis 
Probabilistic principal component analysis 
Row-echelon form 

Symmetric, positive definite 

Support vector machine 
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Introduction and Motivation 


Machine learning is about designing algorithms that automatically extract 
valuable information from data. The emphasis here is on “automatic”, i.e., 
machine learning is concerned about general-purpose methodologies that 
can be applied to many datasets, while producing something that is mean- 
ingful. There are three concepts that are at the core of machine learning: 
data, a model, and learning. 


Since machine learning is inherently data driven, data is at the core 
of machine learning. The goal of machine learning is to design general- 
purpose methodologies to extract valuable patterns from data, ideally 
without much domain-specific expertise. For example, given a large corpus 
of documents (e.g., books in many libraries), machine learning methods 
can be used to automatically find relevant topics that are shared across 
documents (Hoffman et al., 2010). To achieve this goal, we design mod- 
els that are typically related to the process that generates data, similar to 
the dataset we are given. For example, in a regression setting, the model 
would describe a function that maps inputs to real-valued outputs. To 
paraphrase Mitchell (1997): A model is said to learn from data if its per- 
formance on a given task improves after the data is taken into account. 
The goal is to find good models that generalize well to yet unseen data, 
which we may care about in the future. Learning can be understood as a 
way to automatically find patterns and structure in data by optimizing the 
parameters of the model. 


While machine learning has seen many success stories, and software is 
readily available to design and train rich and flexible machine learning 
systems, we believe that the mathematical foundations of machine learn- 
ing are important in order to understand fundamental principles upon 
which more complicated machine learning systems are built. Understand- 
ing these principles can facilitate creating new machine learning solutions, 
understanding and debugging existing approaches, and learning about the 
inherent assumptions and limitations of the methodologies we are work- 
ing with. 
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data 


model 


learning 


predictor 


training 


data as vectors 


model 


learning 


12 Introduction and Motivation 


1.1 Finding Words for Intuitions 


A challenge we face regularly in machine learning is that concepts and 
words are slippery, and a particular component of the machine learning 
system can be abstracted to different mathematical concepts. For example, 
the word “algorithm” is used in at least two different senses in the con- 
text of machine learning. In the first sense, we use the phrase “machine 
learning algorithm” to mean a system that makes predictions based on in- 
put data. We refer to these algorithms as predictors. In the second sense, 
we use the exact same phrase “machine learning algorithm” to mean a 
system that adapts some internal parameters of the predictor so that it 
performs well on future unseen input data. Here we refer to this adapta- 
tion as training a system. 

This book will not resolve the issue of ambiguity, but we want to high- 
light upfront that, depending on the context, the same expressions can 
mean different things. However, we attempt to make the context suffi- 
ciently clear to reduce the level of ambiguity. 

The first part of this book introduces the mathematical concepts and 
foundations needed to talk about the three main components of a machine 
learning system: data, models, and learning. We will briefly outline these 
components here, and we will revisit them again in Chapter 8 once we 
have discussed the necessary mathematical concepts. 

While not all data is numerical, it is often useful to consider data in 
a number format. In this book, we assume that data has already been 
appropriately converted into a numerical representation suitable for read- 
ing into a computer program. Therefore, we think of data as vectors. As 
another illustration of how subtle words are, there are (at least) three 
different ways to think about vectors: a vector as an array of numbers (a 
computer science view), a vector as an arrow with a direction and magni- 
tude (a physics view), and a vector as an object that obeys addition and 
scaling (a mathematical view). 

A model is typically used to describe a process for generating data, sim- 
ilar to the dataset at hand. Therefore, good models can also be thought 
of as simplified versions of the real (unknown) data-generating process, 
capturing aspects that are relevant for modeling the data and extracting 
hidden patterns from it. A good model can then be used to predict what 
would happen in the real world without performing real-world experi- 
ments. 

We now come to the crux of the matter, the learning component of 
machine learning. Assume we are given a dataset and a suitable model. 
Training the model means to use the data available to optimize some pa- 
rameters of the model with respect to a utility function that evaluates how 
well the model predicts the training data. Most training methods can be 
thought of as an approach analogous to climbing a hill to reach its peak. 
In this analogy, the peak of the hill corresponds to a maximum of some 
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desired performance measure. However, in practice, we are interested in 
the model to perform well on unseen data. Performing well on data that 
we have already seen (training data) may only mean that we found a 
good way to memorize the data. However, this may not generalize well to 
unseen data, and, in practical applications, we often need to expose our 
machine learning system to situations that it has not encountered before. 

Let us summarize the main concepts of machine learning that we cover 
in this book: 


= We represent data as vectors. 

= We choose an appropriate model, either using the probabilistic or opti- 
mization view. 

= We learn from available data by using numerical optimization methods 
with the aim that the model performs well on data not used for training. 


1.2 Two Ways to Read This Book 


We can consider two strategies for understanding the mathematics for 
machine learning: 


= Bottom-up: Building up the concepts from foundational to more ad- 
vanced. This is often the preferred approach in more technical fields, 
such as mathematics. This strategy has the advantage that the reader 
at all times is able to rely on their previously learned concepts. Unfor- 
tunately, for a practitioner many of the foundational concepts are not 
particularly interesting by themselves, and the lack of motivation means 
that most foundational definitions are quickly forgotten. 

= Top-down: Drilling down from practical needs to more basic require- 
ments. This goal-driven approach has the advantage that the readers 
know at all times why they need to work on a particular concept, and 
there is a clear path of required knowledge. The downside of this strat- 
egy is that the knowledge is built on potentially shaky foundations, and 
the readers have to remember a set of words that they do not have any 
way of understanding. 


We decided to write this book in a modular way to separate foundational 
(mathematical) concepts from applications so that this book can be read 
in both ways. The book is split into two parts, where Part I lays the math- 
ematical foundations and Part II applies the concepts from Part I to a set 
of fundamental machine learning problems, which form four pillars of 
machine learning as illustrated in Figure 1.1: regression, dimensionality 
reduction, density estimation, and classification. Chapters in Part I mostly 
build upon the previous ones, but it is possible to skip a chapter and work 
backward if necessary. Chapters in Part II are only loosely coupled and 
can be read in any order. There are many pointers forward and backward 
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Figure 1.1 The 
foundations and 
four pillars of 
machine learning. 


linear algebra 


analytic geometry 


matrix 
decomposition 


14 Introduction and Motivation 


Machine Learning 


Vector Calculus Probability & Distributions Optimization 


Linear Algebra Analytic Geometry 





between the two parts of the book to link mathematical concepts with 
machine learning algorithms. 

Of course there are more than two ways to read this book. Most readers 
learn using a combination of top-down and bottom-up approaches, some- 
times building up basic mathematical skills before attempting more com- 
plex concepts, but also choosing topics based on applications of machine 
learning. 


Part I Is about Mathematics 


The four pillars of machine learning we cover in this book (see Figure 1.1) 
require a solid mathematical foundation, which is laid out in Part I. 

We represent numerical data as vectors and represent a table of such 
data as a matrix. The study of vectors and matrices is called linear algebra, 
which we introduce in Chapter 2. The collection of vectors as a matrix is 
also described there. 

Given two vectors representing two objects in the real world, we want 
to make statements about their similarity. The idea is that vectors that 
are similar should be predicted to have similar outputs by our machine 
learning algorithm (our predictor). To formalize the idea of similarity be- 
tween vectors, we need to introduce operations that take two vectors as 
input and return a numerical value representing their similarity. The con- 
struction of similarity and distances is central to analytic geometry and is 
discussed in Chapter 3. 

In Chapter 4, we introduce some fundamental concepts about matri- 
ces and matrix decomposition. Some operations on matrices are extremely 
useful in machine learning, and they allow for an intuitive interpretation 
of the data and more efficient learning. 

We often consider data to be noisy observations of some true underly- 
ing signal. We hope that by applying machine learning we can identify the 
signal from the noise. This requires us to have a language for quantify- 
ing what “noise” means. We often would also like to have predictors that 
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allow us to express some sort of uncertainty, e.g., to quantify the confi- 
dence we have about the value of the prediction at a particular test data 
point. Quantification of uncertainty is the realm of probability theory and 
is covered in Chapter 6. 

To train machine learning models, we typically find parameters that 
maximize some performance measure. Many optimization techniques re- 
quire the concept of a gradient, which tells us the direction in which to 
search for a solution. Chapter 5 is about vector calculus and details the 
concept of gradients, which we subsequently use in Chapter 7, where we 
talk about optimization to find maxima/minima of functions. 


Part II Is about Machine Learning 


The second part of the book introduces four pillars of machine learning 
as shown in Figure 1.1. We illustrate how the mathematical concepts in- 
troduced in the first part of the book are the foundation for each pillar. 
Broadly speaking, chapters are ordered by difficulty (in ascending order). 

In Chapter 8, we restate the three components of machine learning 
(data, models, and parameter estimation) in a mathematical fashion. In 
addition, we provide some guidelines for building experimental set-ups 
that guard against overly optimistic evaluations of machine learning sys- 
tems. Recall that the goal is to build a predictor that performs well on 
unseen data. 

In Chapter 9, we will have a close look at linear regression, where our 
objective is to find functions that map inputs x € R? to corresponding ob- 
served function values y € R, which we can interpret as the labels of their 
respective inputs. We will discuss classical model fitting (parameter esti- 
mation) via maximum likelihood and maximum a posteriori estimation, 
as well as Bayesian linear regression, where we integrate the parameters 
out instead of optimizing them. 

Chapter 10 focuses on dimensionality reduction, the second pillar in Fig- 
ure 1.1, using principal component analysis. The key objective of dimen- 
sionality reduction is to find a compact, lower-dimensional representation 
of high-dimensional data x € RP, which is often easier to analyze than 
the original data. Unlike regression, dimensionality reduction is only con- 
cerned about modeling the data — there are no labels associated with a 
data point x. 

In Chapter 11, we will move to our third pillar: density estimation. The 
objective of density estimation is to find a probability distribution that de- 
scribes a given dataset. We will focus on Gaussian mixture models for this 
purpose, and we will discuss an iterative scheme to find the parameters of 
this model. As in dimensionality reduction, there are no labels associated 
with the data points x € IR”. However, we do not seek a low-dimensional 
representation of the data. Instead, we are interested in a density model 
that describes the data. 

Chapter 12 concludes the book with an in-depth discussion of the fourth 
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pillar: classification. We will discuss classification in the context of support 
vector machines. Similar to regression (Chapter 9), we have inputs x and 
corresponding labels y. However, unlike regression, where the labels were 
real-valued, the labels in classification are integers, which requires special 
care. 


1.3 Exercises and Feedback 


We provide some exercises in Part I, which can be done mostly by pen and 
paper. For Part II, we provide programming tutorials (jupyter notebooks) 
to explore some properties of the machine learning algorithms we discuss 
in this book. 

We appreciate that Cambridge University Press strongly supports our 
aim to democratize education and learning by making this book freely 
available for download at 


https: //mml-book.com 


where tutorials, errata, and additional materials can be found. Mistakes 
can be reported and feedback provided using the preceding URL. 
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Linear Algebra 


When formalizing intuitive concepts, a common approach is to construct a 
set of objects (symbols) and a set of rules to manipulate these objects. This 
is known as an algebra. Linear algebra is the study of vectors and certain 
rules to manipulate vectors. The vectors many of us know from school are 
called “geometric vectors”, which are usually denoted by a small arrow 
above the letter, e.g., a and yF. In this book, we discuss more general 
concepts of vectors and use a bold letter to represent them, e.g., x and y. 

In general, vectors are special objects that can be added together and 
multiplied by scalars to produce another object of the same kind. From 
an abstract mathematical viewpoint, any object that satisfies these two 
properties can be considered a vector. Here are some examples of such 
vector objects: 


1. Geometric vectors. This example of a vector may be familiar from high 
school mathematics and physics. Geometric vectors — see Figure 2.1(a) 
— are directed segments, which can be drawn (at least in two dimen- 
sions). Two geometric vectors T, y can be added, such that a + y =z 
is another geometric vector. Furthermore, multiplication by a scalar 
A T, AÀ € R, is also a geometric vector. In fact, it is the original vector 
scaled by À. Therefore, geometric vectors are instances of the vector 
concepts introduced previously. Interpreting vectors as geometric vec- 
tors enables us to use our intuitions about direction and magnitude to 
reason about mathematical operations. 

2. Polynomials are also vectors; see Figure 2.1(b): Two polynomials can 
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(a) Geometric vectors. (b) Polynomials. 
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algebra 


Figure 2.1 

Different types of 
vectors. Vectors can 
be surprising 
objects, including 
(a) geometric 
vectors 

and (b) polynomials. 


Be careful to check 
whether array 
operations actually 
perform vector 
operations when 
implementing on a 
computer. 


Pavel Grinfeld’s 
series on linear 
algebra: 
http://tinyurl. 
com/nahclwm 
Gilbert Strang’s 
course on linear 
algebra: 
http://tinyurl. 
com/29p5q8j 
3Blue1Brown series 
on linear algebra: 
https://tinyurl. 
com/h5g4kps 
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be added together, which results in another polynomial; and they can 
be multiplied by a scalar A € R, and the result is a polynomial as 
well. Therefore, polynomials are (rather unusual) instances of vectors. 
Note that polynomials are very different from geometric vectors. While 
geometric vectors are concrete “drawings”, polynomials are abstract 
concepts. However, they are both vectors in the sense previously de- 
scribed. 


3. Audio signals are vectors. Audio signals are represented as a series of 
numbers. We can add audio signals together, and their sum is a new 
audio signal. If we scale an audio signal, we also obtain an audio signal. 
Therefore, audio signals are a type of vector, too. 


4. Elements of R” (tuples of n real numbers) are vectors. R” is more 
abstract than polynomials, and it is the concept we focus on in this 
book. For instance, 


1 
2| € R? 
3 


a= (2.1) 


is an example of a triplet of numbers. Adding two vectors a,b € R” 
component-wise results in another vector: a + b = c € R”. Moreover, 
multiplying a € R” by A € R results in a scaled vector àa € R”. 
Considering vectors as elements of R” has an additional benefit that 
it loosely corresponds to arrays of real numbers on a computer. Many 
programming languages support array operations, which allow for con- 
venient implementation of algorithms that involve vector operations. 


Linear algebra focuses on the similarities between these vector concepts. 
We can add them together and multiply them by scalars. We will largely 
focus on vectors in R” since most algorithms in linear algebra are for- 
mulated in R”. We will see in Chapter 8 that we often consider data to 
be represented as vectors in R”. In this book, we will focus on finite- 
dimensional vector spaces, in which case there is a 1:1 correspondence 
between any kind of vector and R”. When it is convenient, we will use 
intuitions about geometric vectors and consider array-based algorithms. 

One major idea in mathematics is the idea of “closure”. This is the ques- 
tion: What is the set of all things that can result from my proposed oper- 
ations? In the case of vectors: What is the set of vectors that can result by 
starting with a small set of vectors, and adding them to each other and 
scaling them? This results in a vector space (Section 2.4). The concept of 
a vector space and its properties underlie much of machine learning. The 
concepts introduced in this chapter are summarized in Figure 2.2. 

This chapter is mostly based on the lecture notes and books by Drumm 
and Weil (2001), Strang (2003), Hogben (2013), Liesen and Mehrmann 
(2015), as well as Pavel Grinfeld’s Linear Algebra series. Other excellent 
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resources are Gilbert Strang’s Linear Algebra course at MIT and the Linear 
Algebra Series by 3Blue1Brown. 

Linear algebra plays an important role in machine learning and gen- 
eral mathematics. The concepts introduced in this chapter are further ex- 
panded to include the idea of geometry in Chapter 3. In Chapter 5, we 
will discuss vector calculus, where a principled knowledge of matrix op- 
erations is essential. In Chapter 10, we will use projections (to be intro- 
duced in Section 3.8) for dimensionality reduction with principal compo- 
nent analysis (PCA). In Chapter 9, we will discuss linear regression, where 
linear algebra plays a central role for solving least-squares problems. 


2.1 Systems of Linear Equations 


Systems of linear equations play a central part of linear algebra. Many 
problems can be formulated as systems of linear equations, and linear 
algebra gives us the tools for solving them. 


Example 2.1 

A company produces products N,,...,N, for which resources 
R,,..., Rm are required. To produce a unit of product N,, a;; units of 
resource R; are needed, where i = 1,...,mand7=1,...,n. 


The objective is to find an optimal production plan, i.e., a plan of how 
many units xj of product N; should be produced if a total of b; units of 
resource R; are available and (ideally) no resources are left over. 

If we produce z1,..., £n units of the corresponding products, we need 
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Figure 2.2 A mind 
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introduced in this 
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in other parts of the 
book. 
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a total of 
CUE Menlo le Or (2.2) 


many units of resource R;. An optimal production plan (2,,...,2,) € R”, 
therefore, has to satisfy the following system of equations: 


Guidi a Gina = b1 
: ; (2.3) 
Am1%1 aaah amntn = bm 


where a;; E€ R and b; € R. 


Equation (2.3) is the general form of a system of linear equations, and 
Z1,- -, Zn are the unknowns of this system. Every n-tuple (£1,..., £n) € 
R” that satisfies (2.3) is a solution of the linear equation system. 


Example 2.2 
The system of linear equations 


Ly + v2 + T3 = 3 (1) 
224 + 323 = il (3) 


has no solution: Adding the first two equations yields 2x, +3z3 = 5, which 
contradicts the third equation (3). 
Let us have a look at the system of linear equations 


ae se E E AP T3 =F (1) 
tm — @ + 2% = 2 (2E (2.5) 
To eae £3 =W? (3) 





From the first and third equation, it follows that xı = 1. From (1)+(2), 
we get 27, + 3x3 = 5,ie., v3 = 1. From (3), we then get that z = 1. 
Therefore, (1,1,1) is the only possible and unique solution (verify that 
(1,1, 1) is a solution by plugging in). 

As a third example, we consider 


Tı aie CO sale T3 =) (1) 
Tı = 665) IP 223 = (2) $ (2.6) 
221 Fs 323 = (3) 





Since (1)+(2)=(3), we can omit the third equation (redundancy). From 
(1) and (2), we get 27, = 5—323 and 2%. = 1+23. We definexz3 =a E€ R 
as a free variable, such that any triplet 


OR eel 
Oe De 
G g” 9 TOF aER (2.7) 
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An, + 4x =i 


221 — Aro =i 


is a solution of the system of linear equations, i.e., we obtain a solution 
set that contains infinitely many solutions. 


In general, for a real-valued system of linear equations we obtain either 
no, exactly one, or infinitely many solutions. Linear regression (Chapter 9) 
solves a version of Example 2.1 when we cannot solve the system of linear 
equations. 


Remark (Geometric Interpretation of Systems of Linear Equations). In a 
system of linear equations with two variables x1, x2, each linear equation 
defines a line on the 7,2 -plane. Since a solution to a system of linear 
equations must satisfy all equations simultaneously, the solution set is the 
intersection of these lines. This intersection set can be a line (if the linear 
equations describe the same line), a point, or empty (when the lines are 
parallel). An illustration is given in Figure 2.3 for the system 


Ax, + 4x2 =5 
(2.8) 

221 = Ax» =l 
where the solution space is the point (£1, £2) = (1, +). Similarly, for three 
variables, each linear equation determines a plane in three-dimensional 
space. When we intersect these planes, i.e., satisfy all linear equations at 
the same time, we can obtain a solution set that is a plane, a line, a point 
or empty (when the planes have no common intersection). > 


For a systematic approach to solving systems of linear equations, we 
will introduce a useful compact notation. We collect the coefficients a;; 
into vectors and collect the vectors into matrices. In other words, we write 
the system from (2.3) in the following form: 


Q11 a12 Qin by 


Am1 Am2 Amn bm 
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Figure 2.3 The 
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lines. Every linear 
equation represents 
a line. 
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Q11 Gin Tı by 


(2.10) 


Am1 Amn In bm 


In the following, we will have a close look at these matrices and de- 
fine computation rules. We will return to solving linear equations in Sec- 
tion 2.3. 


2.2 Matrices 


Matrices play a central role in linear algebra. They can be used to com- 
pactly represent systems of linear equations, but they also represent linear 
functions (linear mappings) as we will see later in Section 2.7. Before we 
discuss some of these interesting topics, let us first define what a matrix 
is and what kind of operations we can do with matrices. We will see more 
properties of matrices in Chapter 4. 


Definition 2.1 (Matrix). With m,n € N a real-valued (m, n) matrix A is 
an m-n-tuple of elements a;j, i = 1,..., mM, j = 1,...,n, which is ordered 
according to a rectangular scheme consisting of m rows and n columns: 


Qil Q12 Qin 
a21 a22 Q2n 
A= : 5 Qij ER. (2.11) 
m Am2 a 


By convention (1, n)-matrices are called rows and (m, 1)-matrices are called 
columns. These special matrices are also called row/column vectors. 


R™*” is the set of all real-valued (m, n)-matrices. A € R™*” can be 
equivalently represented as a € R™” by stacking all n columns of the 
matrix into a long vector; see Figure 2.4. 


2.2.1 Matrix Addition and Multiplication 


The sum of two matrices A € R™*”", B € R™*” is defined as the element- 
wise sum, i.e., 


ay, + by Gin + din 


A+B:= ek", (2.12) 








ain Drin 


For matrices A € R™*", B € R”**, the elements c,; of the product 
C = AB € R”** are computed as 


Qmi 1 bmi Amn 


= Soa Ts heane Tlk (2.13) 
t=1 


Draft (2021-01-14) of “Mathematics for Machine Learning”. Feedback: https: //mml-book.com. 


2.2 Matrices 23 


This means, to compute element c;; we multiply the elements of the 7th 
row of A with the jth column of B and sum them up. Later in Section 3.2, 
we will call this the dot product of the corresponding row and column. In 
cases, where we need to be explicit that we are performing multiplication, 
we use the notation A - B to denote multiplication (explicitly showing 
ey 

Remark. Matrices can only be multiplied if their “neighboring” dimensions 
match. For instance, an n x k-matrix A can be multiplied with a k x m- 
matrix B, but only from the left side: 


A B=C (2.14) 
Sewn 
nxk kxm nxm 


The product BA is not defined if m 4 n since the neighboring dimensions 
do not match. © 


Remark. Matrix multiplication is not defined as an element-wise operation 
on matrix elements, i.e., c;; # a:;b;; (even if the size of A, B was cho- 
sen appropriately). This kind of element-wise multiplication often appears 
in programming languages when we multiply (multi-dimensional) arrays 


with each other, and is called a Hadamard product. © 
Example 2.3 
T? 3 TE: 
"ra= l; 9 || € R29, B = 1 —1] € R?*?, we obtain 
0 1 
0 2 
AB =|) 2 l 1 -1 a 4 € R?*?, (2.15) 
0 1 
0 2 123 6 4 2 
BA= |1 -1 ; ; ie —2 0 2| € R”. (2.16) 
0 1 3 2 ll 


From this example, we can already see that matrix multiplication is not 
commutative, i.e., AB # B A; see also Figure 2.5 for an illustration. 


Definition 2.2 (Identity Matrix). In R”*”, we define the identity matrix 


10-0- 0 
01. 0- 0 


e) 
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There are n columns 
in A and n rows in 
B so that we can 
compute a,,b;; for 
b= Tatas: 
Commonly, the dot 
product between 
two vectors a, b is 
denoted by a | b or 
(a,b). 


Hadamard product 


Figure 2.5 Even if 
both matrix 
multiplications AB 
and BA are 
defined, the 
dimensions of the 
results can be 
different. 
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as the n x n-matrix containing 1 on the diagonal and 0 everywhere else. 


Now that we defined matrix multiplication, matrix addition and the 
identity matrix, let us have a look at some properties of matrices: 


= Associativity: 
VA € R”””, B € R”””?,C € R?” :(AB)C = A(BC) (2.18) 

= Distributivity: 
VA,BeR™",C,DER"?: (A+ B)C=AC+ BC (2.19a) 
A(C+ D)=AC+AD_  (2.19b) 

= Multiplication with the identity matrix: 
VA E R” : IA = AI, =A (2.20) 
Note that J,,, 4 I, form #n. 


2.2.2 Inverse and Transpose 


Definition 2.3 (Inverse). Consider a square matrix A € IR”*”. Let matrix 
B © R”*” have the property that AB = I,, = BA. B is called the 
inverse of A and denoted by A™’. 


Unfortunately, not every matrix A possesses an inverse A’. If this 
inverse does exist, A is called regular/invertible/nonsingular, otherwise 
singular/noninvertible. When the matrix inverse exists, it is unique. In Sec- 
tion 2.3, we will discuss a general way to compute the inverse of a matrix 
by solving a system of linear equations. 


Remark (Existence of the Inverse of a 2 x 2-matrix). Consider a matrix 


Ae P | € R2”? (2.21) 
: 21 Q22 i f 
If we multiply A with 
A' — | a22 n (2 22) 
i —û21) Ay , 
we obtain 
= 0 
AA = Q11đ22 — Q12Q21 E E AE E I 
0 Q11đ22 — Q12Q21 ( sor 12 21) 
(2.23) 
Therefore, 
Ae ee | a22 P (2.24) 
Q11đ22 — Q12Q21 |7021 Q1 


if and only if a11@22 — a12021 Æ 0. In Section 4.1, we will see that a11a22 — 
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@12Q2, is the determinant of a 2 x 2-matrix. Furthermore, we can generally 
use the determinant to check whether a matrix is invertible. ® 


Example 2.4 (Inverse Matrix) 

The matrices 

1 —7 -7 6 

S| , = || 2 1 -1 
t 4 5 —4 


il 
A-|4 4 (2.25) 
ony 


are inverse to each other since AB = I = BA. 


Definition 2.4 (Transpose). For A € R”*” the matrix B € R”*”™ with 
bij = aji is called the transpose of A. We write B = A'. 


In general, A' can be obtained by writing the columns of A as the rows 
of A'. The following are important properties of inverses and transposes: 














AA'=I=A'A (2.26) 
(AB) = B'A! (2.27) 
(A+B)! #4 A` +B“ (2.28) 
(A')JT=A (2.29) 
(A+ B)'=A'+B' (2.30) 
(AB)'=B'A' (2.31) 


Definition 2.5 (Symmetric Matrix). A matrix A € R”*” is symmetric if 
A= A". 

Note that only (n,n)-matrices can be symmetric. Generally, we call 
(n,n)-matrices also square matrices because they possess the same num- 
ber of rows and columns. Moreover, if A is invertible, then so is A', and 
(ATSA EA 
Remark (Sum and Product of Symmetric Matrices). The sum of symmet- 


ric matrices A, B € R”*” is always symmetric. However, although their 
product is always defined, it is generally not symmetric: 


o olli 15o ol 


2.2.3 Multiplication by a Scalar 


(2.32) 


Q 


Let us look at what happens to matrices when they are multiplied by a 
scalar A € R. Let A € R”*” and À € R. Then \A = K, Ki = Adij. 
Practically, A scales each element of A. For à, € R, the following holds: 
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= Associativity: 
Av)C =rA(~C), CeR™” 
# \(BC) = (AB)C = B(AC) = (BC), BeR™",CeER™*, 
Note that this allows us to move scalar values around. 
a (AC) = C'A" = C'A = AC" since à = à" forall À € R. 
= Distributivity: 
(A+v)C =AC + ¥C, CeR™” 
M(B+C)=AB+AC, B,C eER™” 


Example 2.5 (Distributivity) 
If we define 


Ce F I : (2.33) 


then for any à, y € R we obtain 











OO OAT A NTO 
atye= jA S Io ee eee 

Eon ie 

a a 1y] =C +40. (2.34b) 


2.2.4 Compact Representations of Systems of Linear Equations 
If we consider the system of linear equations 
224 ie 3X9 + 023 =1 
An, = 2X2 a 7x3 =8 (2.35) 


9x1 nie 5x2 a 323 =2 





and use the rules for matrix multiplication, we can write this equation 
system in a more compact form as 


2 3 5 Tı 1 
Al 2) = Lo) = |8|. (2.36) 
9 5 —3] [2x3 2 


Note that x, scales the first column, xə the second one, and x, the third 
one. 

Generally, a system of linear equations can be compactly represented in 
their matrix form as Aa = b; see (2.3), and the product Az is a (linear) 
combination of the columns of A. We will discuss linear combinations in 
more detail in Section 2.5. 
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2.3 Solving Systems of Linear Equations 
In (2.3), we introduced the general form of an equation system, i.e., 
Qizi +-+ $ An ky = by 
(2.37) 
Ci Di Pe Ppa = Om 5 


where a;i; € R and b; € R are known constants and xz; are unknowns, 
i=1,...,m, j = 1,...,n. Thus far, we saw that matrices can be used as 
a compact way of formulating systems of linear equations so that we can 
write Ax = b, see (2.10). Moreover, we defined basic matrix operations, 
such as addition and multiplication of matrices. In the following, we will 
focus on solving systems of linear equations and provide an algorithm for 
finding the inverse of a matrix. 


2.3.1 Particular and General Solution 


Before discussing how to generally solve systems of linear equations, let 
us have a look at an example. Consider the system of equations 


Tı 

1 0 8 —4] |z| _ J42 

b 1 2 T2 T3 Sel: reo) 
T4 


The system has two equations and four unknowns. Therefore, in general 
we would expect infinitely many solutions. This system of equations is 
in a particularly easy form, where the first two columns consist of a 1 
and a 0. Remember that we want to find scalars x7,,...,24, such that 
ss x;,C; = b, where we define c; to be the ith column of the matrix and 
b the right-hand-side of (2.38). A solution to the problem in (2.38) can 
be found immediately by taking 42 times the first column and 8 times the 
second column so that 


-iaj am 


Therefore, a solution is [42,8,0,0]'. This solution is called a particular 
solution or special solution. However, this is not the only solution of this 
system of linear equations. To capture all the other solutions, we need 
to be creative in generating O in a non-trivial way using the columns of 
the matrix: Adding 0 to our special solution does not change the special 
solution. To do so, we express the third column using the first two columns 
(which are of this very simple form) 


J-i- a 
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so that O = 8c; + 2c — 1c + 0c, and (z1, £2, £3, £4) = (8,2, —1, 0). In 
fact, any scaling of this solution by A, € R produces the 0 vector, i.e., 


| 

i a | At > = A1 (861 + 2C2 T% C3) =0. (2.41) 
bo 

Following the same line of reasoning, we express the fourth column of the 


matrix in (2.38) using the first two columns and generate another set of 
non-trivial versions of 0 as 


—4 

1 0 8 —4 12 

i 1 2 | Àz 0 = A2(—4e; + 12c, — cy) = 0 (2.42) 
=| 


for any A2 € R. Putting everything together, we obtain all solutions of the 
equation system in (2.38), which is called the general solution, as the set 


42 8 —4 
zeR*:a2= : +r = + A2 x sAr,A2ER?P. (2.43) 
0 0 —1 


Remark. The general approach we followed consisted of the following 
three steps: 


1. Find a particular solution to Aa = b. 
2. Find all solutions to Aw = 0. 
3. Combine the solutions from steps 1. and 2. to the general solution. 


Neither the general nor the particular solution is unique. ro 


The system of linear equations in the preceding example was easy to 
solve because the matrix in (2.38) has this particularly convenient form, 
which allowed us to find the particular and the general solution by in- 
spection. However, general equation systems are not of this simple form. 
Fortunately, there exists a constructive algorithmic way of transforming 
any system of linear equations into this particularly simple form: Gaussian 
elimination. Key to Gaussian elimination are elementary transformations 
of systems of linear equations, which transform the equation system into 
a simple form. Then, we can apply the three steps to the simple form that 
we just discussed in the context of the example in (2.38). 


2.3.2 Elementary Transformations 


Key to solving a system of linear equations are elementary transformations 
that keep the solution set the same, but that transform the equation system 
into a simpler form: 
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= Exchange of two equations (rows in the matrix representing the system 

of equations) 
= Multiplication of an equation (row) with a constant \ € R\{0} 
= Addition of two equations (rows) 


Example 2.6 


For a € R, we seek all solutions of the following system of equations: 


—2zı 
Ax, 
Tı 

Tı 


+ 


Axe 
8X2 
225 
225 





223 
323 
v3 


T4 


324 


T4 


324 


+ 
+ 
+ 
+ 


Avs 
T5 
T5 

Avs 


—3 
2 
0° 
a 


(2.44) 


We start by converting this system of equations into the compact matrix 
notation Ax = b. We no longer mention the variables æ explicitly and 
build the augmented matrix (in the form [A | b|) 


ie 


4 


—8 
=A 
=? 


=2 


—1 
3. —3 
| 
0 —83 


4 
1 
1 
4 


—3 


2 


a 


Swap with R3 


0 | Swap with Rı 


where we used the vertical line to separate the left-hand side from the 
right-hand side in (2.44). We use ~~ to indicate a transformation of the 
augmented matrix using elementary transformations. 


Swapping Rows 1 and 3 leads to 


=3 
—8 


| 


1 
4 


= 


1 


4 


=? 


1 
3 
z2 
0 


—1 
—3 
—1 
—3 


Ree 


0 
2 
—3 
a 


| 


—4R, 
+2R, 
E 


When we now apply the indicated transformations (e.g., subtract Row 1 
four times from Row 2), we obtain 


e 


Slo.) S30 oor) 


=? 
0 
0 


(a) 


ee ee ea as a 


1 
= 
0 
= 
1 
—1 
0 


Sor SS 


—1 
1 
—3 
=? 
—1 
1 
—3 
0 
—1 
—1 
1 
0 


ONWrF ODWF WwMrwrF 
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This (augmented) matrix is in a convenient form, the row-echelon form 
(REF). Reverting this compact notation back into the explicit notation with 
the variables we seek, we obtain 





zı — Ww + #3 — 4% + T5 = 0 
Co A eS ee — =e 
z4 — 2zt5; = 1 (2:49) 
0 = a+l 
Only for a = —1 this system can be solved. A particular solution is 
Ly 2 
T2 0 
z3| = |—1 (2.46) 
T4 1 
£5 0 


The general solution, which captures the set of all possible solutions, is 


2 2 2 
0 1 0 
LIE R? Ip + Ài 0 + A2 —1 ; Ai, A2 ER (2.47) 
1 0 2 
0 0 1 


In the following, we will detail a constructive way to obtain a particular 
and general solution of a system of linear equations. 


Remark (Pivots and Staircase Structure). The leading coefficient of a row 
(first nonzero number from the left) is called the pivot and is always 
strictly to the right of the pivot of the row above it. Therefore, any equa- 
tion system in row-echelon form always has a “staircase” structure. ro 


Definition 2.6 (Row-Echelon Form). A matrix is in row-echelon form if 


= All rows that contain only zeros are at the bottom of the matrix; corre- 
spondingly, all rows that contain at least one nonzero element are on 
top of rows that contain only zeros. 

= Looking at nonzero rows only, the first nonzero number from the left 
(also called the pivot or the leading coefficient) is always strictly to the 
right of the pivot of the row above it. 


Remark (Basic and Free Variables). The variables corresponding to the 
pivots in the row-echelon form are called basic variables and the other 
variables are free variables. For example, in (2.45), £1, £3, %4 are basic 
variables, whereas x2, x; are free variables. ©% 


Remark (Obtaining a Particular Solution). The row-echelon form makes 
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our lives easier when we need to determine a particular solution. To do 
this, we express the right-hand side of the equation system using the pivot 
columns, such that b = Soke Aip; where p;, i = 1,..., P, are the pivot 
columns. The A; are determined easiest if we start with the rightmost pivot 
column and work our way to the left. 


In the previous example, we would try to find A1, A2, A3 so that 


1 1 —1 0 
0 1 —1 —2 

Ài 0 + Ag 0 + A3 pl peri (2.48) 
0 0 0 0 


From here, we find relatively directly that å = 1, A2 = —1, A, = 2. When 
we put everything together, we must not forget the non-pivot columns 
for which we set the coefficients implicitly to 0. Therefore, we get the 
particular solution x = [2,0,—1,1,0]'. © 
Remark (Reduced Row Echelon Form). An equation system is in reduced 
row-echelon form (also: row-reduced echelon form or row canonical form) if 


= It is in row-echelon form. 
= Every pivot is 1. 
= The pivot is the only nonzero entry in its column. 


Q 


The reduced row-echelon form will play an important role later in Sec- 
tion 2.3.3 because it allows us to determine the general solution of a sys- 
tem of linear equations in a straightforward way. 


Remark (Gaussian Elimination). Gaussian elimination is an algorithm that 
performs elementary transformations to bring a system of linear equations 
into reduced row-echelon form. © 


Example 2.7 (Reduced Row Echelon Form) 
Verify that the following matrix is in reduced row-echelon form (the pivots 
are in bold): 


OF |e (2.49) 


The key idea for finding the solutions of Aw = 0 is to look at the non- 
pivot columns, which we will need to express as a (linear) combination of 
the pivot columns. The reduced row echelon form makes this relatively 
straightforward, and we express the non-pivot columns in terms of sums 
and multiples of the pivot columns that are on their left: The second col- 
umn is 3 times the first column (we can ignore the pivot columns on the 
right of the second column). Therefore, to obtain 0, we need to subtract 
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the second column from three times the first column. Now, we look at the 
fifth column, which is our second non-pivot column. The fifth column can 
be expressed as 3 times the first pivot column, 9 times the second pivot 
column, and —4 times the third pivot column. We need to keep track of 
the indices of the pivot columns and translate this into 3 times the first col- 
umn, 0 times the second column (which is a non-pivot column), 9 times 
the third column (which is our second pivot column), and —4 times the 
fourth column (which is the third pivot column). Then we need to subtract 
the fifth column to obtain O. In the end, we are still solving a homogeneous 
equation system. 
To summarize, all solutions of Ax = 0,2 € R° are given by 


3 3 
—1 0 
TER: x= \ 0 + A2 9 3 AA ER : (2.50) 
0 —4 
0 —1 


2.3.3 The Minus-1 Trick 


In the following, we introduce a practical trick for reading out the solu- 
tions x of a homogeneous system of linear equations Aa = O, where 
AER ™" x eR”. 

To start, we assume that A is in reduced row-echelon form without any 
rows that just contain zeros, i.e., 


0 0 --- O 1 * * 
A= . . . 0 ‘ 
0 
0 0 0 0 :-:- 0 0 0 O Ll x +--+ x 
(2.51) 


where x can be an arbitrary real number, with the constraints that the first 
nonzero entry per row must be 1 and all other entries in the corresponding 
column must be 0. The columns jı,..., jẹ with the pivots (marked in 
bold) are the standard unit vectors e,,...,e, € R*. We extend this matrix 
to an n x n-matrix A by adding n — k rows of the form 


[o --» 0 -10 + 0] (2.52) 


so that the diagonal of the augmented matrix A contains either 1 or —1. 
Then, the columns of A that contain the —1 as pivots are solutions of 
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the homogeneous equation system Aa = 0. To be more precise, these 
columns form a basis (Section 2.6.1) of the solution space of Av = 0, 
which we will later call the kernel or null space (see Section 2.7.3). 


Example 2.8 (Minus-1 Trick) 
Let us revisit the matrix in (2.49), which is already in REF: 


13 00 3 
A= 0) @ i @ WO]. (2.53) 
0001 -4 


We now augment this matrix to a 5 x 5 matrix by adding rows of the 
form (2.52) at the places where the pivots on the diagonal are missing 
and obtain 


1 3 00 3 
oeeo oO 
A=|0 010 9]. (2.54) 
(es O 
oo Oe =t 


From this form, we can immediately read out the solutions of Ax = 0 by 
taking the columns of A, which contain —1 on the diagonal: 


3 3 
= 0 

Te RE ENORET N ER (2.55) 
0 =A 
0 =i 


which is identical to the solution in (2.50) that we obtained by “insight”. 


Calculating the Inverse 


To compute the inverse A~' of A € R”*”, we need to find a matrix X 
that satisfies AX = I,,. Then, X = A™‘. We can write this down as 
a set of simultaneous linear equations AX = I,,, where we solve for 
X = [az,|---|x,]. We use the augmented matrix notation for a compact 
representation of this set of systems of linear equations and obtain 


[AlIn] ~s--~  [I,|A~*]. (2.56) 
This means that if we bring the augmented equation system into reduced 
row-echelon form, we can read out the inverse on the right-hand side of 
the equation system. Hence, determining the inverse of a matrix is equiv- 


alent to solving systems of linear equations. 
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Example 2.9 (Calculating an Inverse Matrix by Gaussian Elimination) 
To determine the inverse of 


ss) esa) 
1 1 0 0 
A= Lami (2.57) 
i o il oal 
we write down the augmented matrix 

lL @ 2 @]/ 1 0 O © 

1 1 0O O} 0 1 =O O 

lL & @ tL) Oo @ 2 @ 

lL wd tf ti] @ © © 7 


and use Gaussian elimination to bring it into reduced row-echelon form 


1 0 O O/;-1 2 -2 2 

0 1 0 0; 1 -1 2 -2 

OOO) | ee eee 
0 0 0O 1/;-1 0 -1 2 

such that the desired inverse is given as its right-hand side: 

—1 2 -2 2 

1 -1l 2 -2 

1 -1 1 =1 

-l1 0 -1 2 


(2.58) 


We can verify that (2.58) is indeed the inverse by performing the multi- 
plication AA™' and observing that we recover I. 


2.3.4 Algorithms for Solving a System of Linear Equations 


In the following, we briefly discuss approaches to solving a system of lin- 
ear equations of the form Aa = b. We make the assumption that a solu- 
tion exists. Should there be no solution, we need to resort to approximate 
solutions, which we do not cover in this chapter. One way to solve the ap- 
proximate problem is using the approach of linear regression, which we 
discuss in detail in Chapter 9. 

In special cases, we may be able to determine the inverse A`}, such 
that the solution of Ax = b is given as = A ‘b. However, this is 
only possible if A is a square matrix and invertible, which is often not the 
case. Otherwise, under mild assumptions (i.e., A needs to have linearly 
independent columns) we can use the transformation 


Az=b <— A'Ag= A'b 4 «=(A'A)'A'D (2.59) 
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and use the Moore-Penrose pseudo-inverse (A'.A)~!A' to determine the 
solution (2.59) that solves Aa = b, which also corresponds to the mini- 
mum norm least-squares solution. A disadvantage of this approach is that 
it requires many computations for the matrix-matrix product and comput- 
ing the inverse of A' A. Moreover, for reasons of numerical precision it 
is generally not recommended to compute the inverse or pseudo-inverse. 
In the following, we therefore briefly discuss alternative approaches to 
solving systems of linear equations. 

Gaussian elimination plays an important role when computing deter- 
minants (Section 4.1), checking whether a set of vectors is linearly inde- 
pendent (Section 2.5), computing the inverse of a matrix (Section 2.2.2), 
computing the rank of a matrix (Section 2.6.2), and determining a basis 
of a vector space (Section 2.6.1). Gaussian elimination is an intuitive and 
constructive way to solve a system of linear equations with thousands of 
variables. However, for systems with millions of variables, it is impracti- 
cal as the required number of arithmetic operations scales cubically in the 
number of simultaneous equations. 

In practice, systems of many linear equations are solved indirectly, by ei- 
ther stationary iterative methods, such as the Richardson method, the Ja- 
cobi method, the Gaufs-Seidel method, and the successive over-relaxation 
method, or Krylov subspace methods, such as conjugate gradients, gener- 
alized minimal residual, or biconjugate gradients. We refer to the books 
by Stoer and Burlirsch (2002), Strang (2003), and Liesen and Mehrmann 
(2015) for further details. 

Let x, be a solution of Ax = b. The key idea of these iterative methods 
is to set up an iteration of the form 


a") =~ Ce" 4d (2.60) 


for suitable C and d that reduces the residual error ||a‘*+!) — x, || in every 
iteration and converges to a... We will introduce norms |] - ||, which allow 
us to compute similarities between vectors, in Section 3.1. 


2.4 Vector Spaces 


Thus far, we have looked at systems of linear equations and how to solve 
them (Section 2.3). We saw that systems of linear equations can be com- 
pactly represented using matrix-vector notation (2.10). In the following, 
we will have a closer look at vector spaces, i.e., a structured space in which 
vectors live. 

In the beginning of this chapter, we informally characterized vectors as 
objects that can be added together and multiplied by a scalar, and they 
remain objects of the same type. Now, we are ready to formalize this, 
and we will start by introducing the concept of a group, which is a set 
of elements and an operation defined on these elements that keeps some 
structure of the set intact. 
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2.4.1 Groups 


Groups play an important role in computer science. Besides providing a 
fundamental framework for operations on sets, they are heavily used in 
cryptography, coding theory, and graphics. 


Definition 2.7 (Group). Consider a set G and an operation ® :GxG > G 
defined on G. Then G := (G, ®) is called a group if the following hold: 


. Closure of G under ®:Va,yEG:xr@yeG 

. Associativity: Vz,y,zEG:(x@®y) @z=xL#@ (y®@z) 

. Neutral element: Je © GVxEG:xr®e=xande®xr=-2x 

. Inverse element: Yx € G Jy € G : x Q y = e and y & x = e, where e is 
the neutral element. We often write x~! to denote the inverse element 
of x. 





AUNE 





Remark. The inverse element is defined with respect to the operation & 
and does not necessarily mean L. » 


If additionally Yz, y € G : £ Q y = y 8 x, then G = (G,@®) is an Abelian 
group (commutative). 


Example 2.10 (Groups) 
Let us have a look at some examples of sets with associated operations 
and see whether they are groups: 


= (Z, +) is an Abelian group. 

= (No, +) is not a group: Although (No, +) possesses a neutral element 
(0), the inverse elements are missing. 

« (Z,-) is not a group: Although (Z, -) contains a neutral element (1), the 
inverse elements for any z € Z, z A +1, are missing. 

= (R,-) is not a group since 0 does not possess an inverse element. 

= (R\{0},-) is Abelian. 

» (R”, +), (Z", +), n € IN are Abelian if + is defined componentwise, i.e., 


(£i, a) ag (y Yn) = (x1 wyp An Un) (2.61) 


Then, (£1, - , £n) := (—21,::* ,—£n) is the inverse element and 
e = (0,--- ,0) is the neutral element. 

= (R™®*”, +), the set of m x n-matrices is Abelian (with componentwise 
addition as defined in (2.61)). 

= Let us have a closer look at (R”*”, -), i.e., the set of n x n-matrices with 
matrix multiplication as defined in (2.13). 





— Closure and associativity follow directly from the definition of matrix 
multiplication. 

— Neutral element: The identity matrix J,, is the neutral element with 
respect to matrix multiplication “.” in (R"*”, -). 
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— Inverse element: If the inverse exists (A is regular), then A`! is the 
inverse element of A € R”*”, and in exactly this case (R”*”,-) is a 
group, called the general linear group. 


Definition 2.8 (General Linear Group). The set of regular (invertible) 
matrices A € R”*” is a group with respect to matrix multiplication as 
defined in (2.13) and is called general linear group GL(n, R). However, 
since matrix multiplication is not commutative, the group is not Abelian. 


2.4.2 Vector Spaces 


When we discussed groups, we looked at sets G and inner operations on 
G, i.e., mappings G x G —> G that only operate on elements in G. In the 
following, we will consider sets that in addition to an inner operation + 
also contain an outer operation -, the multiplication of a vector x € G by 
a scalar A € IR. We can think of the inner operation as a form of addition, 
and the outer operation as a form of scaling. Note that the inner/outer 
operations have nothing to do with inner/outer products. 


Definition 2.9 (Vector Space). A real-valued vector space V = (V, +, -) is 
a set V with two operations 


+:VxVoyV (2.62) 
< RXV >V (2.63) 


where 


1. (V,+) is an Abelian group 
2. Distributivity: 


1 VAER, 2, yEeV:rA-(a+y)=A-c+A-y 

2. VA WER, 2EV:(A+y)-v=rXA-x+y-2 
3. Associativity (outer operation): VA, wv € R,x E€ V : A (Yy-x) = (Ay) -x 
4. Neutral element with respect to the outer operation: Væ E€ V : 1- = x 


The elements æ € V are called vectors. The neutral element of (V, +) is 
the zero vector 0 = {0,...,0]', and the inner operation + is called vector 
addition. The elements A € R are called scalars and the outer operation 
- is a multiplication by scalars. Note that a scalar product is something 
different, and we will get to this in Section 3.2. 


Remark. A “vector multiplication” ab, a,b € R”, is not defined. Theoret- 
ically, we could define an element-wise multiplication, such that c = ab 
with c; = a;b;. This “array multiplication” is common to many program- 
ming languages but makes mathematically limited sense using the stan- 
dard rules for matrix multiplication: By treating vectors as n x 1 matrices 
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(which we usually do), we can use the matrix multiplication as defined 
in (2.13). However, then the dimensions of the vectors do not match. Only 
the following multiplications for vectors are defined: ab’ € R”*” (outer 
product), a'b € R (inner/scalar/dot product). © 


Example 2.11 (Vector Spaces) 
Let us have a look at some important examples: 


= Y = R”,n €E N is a vector space with operations defined as follows: 


- Addition: +y = (£1,..., En) + (Y1, -, Yn) = (T1 +41, -- -En Yn) 
for all x,y € R” 

— Multiplication by scalars: Ax = A(£1,..., £n) = (Azı, ..., ALn) for 
allà c R,x € R” 


a Y = R”””,m,n E N is a vector space with 





an +b >> Qin+ Sin 
- Addition: A+ B = : : is defined ele- 
ami T bmi Oe Ola ak One 
mentwise for all A, BEV 
Ady, °°: Adin 
- Multiplication by scalars: \A = : : as defined in 
Aami YRAN RG 


Section 2.2. Remember that R™*” is equivalent to R™”. 


= Y = C, with the standard definition of addition of complex numbers. 


Remark. In the following, we will denote a vector space (V,+,-) by V 
when + and - are the standard vector addition and scalar multiplication. 
Moreover, we will use the notation æ € V for vectors in V to simplify 
notation. © 


Remark. The vector spaces R”, R”*!, R!*” are only different in the way 
we write vectors. In the following, we will not make a distinction between 
IR” and R"*', which allows us to write n-tuples as column vectors 


Tı 
Ba eE y (2.64) 

Tn 
This simplifies the notation regarding vector space operations. However, 
we do distinguish between R”*! and R!*” (the row vectors) to avoid con- 


fusion with matrix multiplication. By default, we write x to denote a col- 
umn vector, and a row vector is denoted by æ! , the transpose of x. © 
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2.4.3 Vector Subspaces 


In the following, we will introduce vector subspaces. Intuitively, they are 
sets contained in the original vector space with the property that when 
we perform vector space operations on elements within this subspace, we 
will never leave it. In this sense, they are “closed”. Vector subspaces are a 
key idea in machine learning. For example, Chapter 10 demonstrates how 
to use vector subspaces for dimensionality reduction. 


Definition 2.10 (Vector Subspace). Let V = (V,+,-) be a vector space 
and U C V,U £ Ú. Then U = (U, +,-) is called vector subspace of V (or 
linear subspace) if U is a vector space with the vector space operations + 
and - restricted to U xU and R xU. We write U C V to denote a subspace 
U of V. 


IfU C V and V is a vector space, then U naturally inherits many prop- 
erties directly from V because they hold for all x € V, and in particular for 
alla € U C V. This includes the Abelian group properties, the distribu- 
tivity, the associativity and the neutral element. To determine whether 
(U,+,-) is a subspace of V we still do need to show 


1. U #9, in particular: 0 € U 
2. Closure of U: 


a. With respect to the outer operation: VA © RVaw E€ U : Ax E U. 
b. With respect to the inner operation: Va,ye@UuU:a+y Eu. 


Example 2.12 (Vector Subspaces) 
Let us have a look at some examples: 


= For every vector space V, the trivial subspaces are V itself and {0}. 

= Only example D in Figure 2.6 is a subspace of R? (with the usual inner/ 
outer operations). In A and C, the closure property is violated; B does 
not contain 0. 

= The solution set of a homogeneous system of linear equations Ax = 0 
with n unknowns gæ = |z1,..., £n]! is a subspace of R”. 

= The solution of an inhomogeneous system of linear equations Ax = 
b, b £ O is not a subspace of R”. 

= The intersection of arbitrarily many subspaces is a subspace itself. 
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Remark. Every subspace U C (IR",+,-) is the solution space of a homo- 
geneous system of linear equations Aw = 0 for x € R”. > 


2.5 Linear Independence 


In the following, we will have a close look at what we can do with vectors 
(elements of the vector space). In particular, we can add vectors together 
and multiply them with scalars. The closure property guarantees that we 
end up with another vector in the same vector space. It is possible to find 
a set of vectors with which we can represent every vector in the vector 
space by adding them together and scaling them. This set of vectors is 
a basis, and we will discuss them in Section 2.6.1. Before we get there, 
we will need to introduce the concepts of linear combinations and linear 
independence. 


Definition 2.11 (Linear Combination). Consider a vector space V and a 
finite number of vectors %1, ..., £p € V. Then, every v € V of the form 


k 
v = AL Senet ae ApL = S° Aii EV (2.65) 


w=1 
with \,,...,A, € R is a linear combination of the vectors #,..., Xp. 


The 0-vector can always be written as the linear combination of k vec- 
tors £1,...,&p because 0 = Da Ox; is always true. In the following, 
we are interested in non-trivial linear combinations of a set of vectors to 
represent 0, i.e., linear combinations of vectors æ, ..., £p, where not all 
coefficients A; in (2.65) are 0. 


Definition 2.12 (Linear (In)dependence). Let us consider a vector space 
V with k © W and a,...,a, € V. If there is a non-trivial linear com- 
bination, such that 0 = y A;x; with at least one A; 4 0, the vectors 
X1,...,a@, are linearly dependent. If only the trivial solution exists, i.e., 
Ay =... =A, = Othe vectors x1,..., 2, are linearly independent. 


Linear independence is one of the most important concepts in linear 
algebra. Intuitively, a set of linearly independent vectors consists of vectors 
that have no redundancy, i.e., if we remove any of those vectors from 
the set, we will lose something. Throughout the next sections, we will 
formalize this intuition more. 


Example 2.13 (Linearly Dependent Vectors) 

A geographic example may help to clarify the concept of linear indepen- 
dence. A person in Nairobi (Kenya) describing where Kigali (Rwanda) is 
might say ,“You can get to Kigali by first going 506 km Northwest to Kam- 
pala (Uganda) and then 374 km Southwest.”. This is sufficient information 
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to describe the location of Kigali because the geographic coordinate sys- 
tem may be considered a two-dimensional vector space (ignoring altitude 
and the Earth’s curved surface). The person may add, “It is about 751 km 
West of here.” Although this last statement is true, it is not necessary to 
find Kigali given the previous information (see Figure 2.7 for an illus- 
tration). In this example, the “506 km Northwest” vector (blue) and the 
“374 km Southwest” vector (purple) are linearly independent. This means 
the Southwest vector cannot be described in terms of the Northwest vec- 
tor, and vice versa. However, the third “751 km West” vector (black) is a 
linear combination of the other two vectors, and it makes the set of vec- 
tors linearly dependent. Equivalently, given “751 km West” and “374km 
Southwest” can be linearly combined to obtain “506 km Northwest”. 


Meru © 


Embu 





Remark. The following properties are useful to find out whether vectors 
are linearly independent: 


= k vectors are either linearly dependent or linearly independent. There 
is no third option. 

= If at least one of the vectors 7,,..., a, is O then they are linearly de- 
pendent. The same holds if two vectors are identical. 

= The vectors {£1,..., £p : £x; # 0,i = 1,...,k}, k > 2, are linearly 
dependent if and only if (at least) one of them is a linear combination 
of the others. In particular, if one vector is a multiple of another vector, 
i.e., ©; = Ax;, À E R then the set {£1,..., £p : 2; 40,1 =1,...,k} 
is linearly dependent. 

= A practical way of checking whether vectors %1,..., £p € V are linearly 
independent is to use Gaussian elimination: Write all vectors as columns 
of a matrix A and perform Gaussian elimination until the matrix is in 
row echelon form (the reduced row-echelon form is unnecessary here): 
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— The pivot columns indicate the vectors, which are linearly indepen- 
dent of the vectors on the left. Note that there is an ordering of vec- 
tors when the matrix is built. 

— The non-pivot columns can be expressed as linear combinations of 
the pivot columns on their left. For instance, the row-echelon form 


1 3 0 
i 0 J (2.66) 


tells us that the first and third columns are pivot columns. The sec- 
ond column is a non-pivot column because it is three times the first 
column. 


All column vectors are linearly independent if and only if all columns 
are pivot columns. If there is at least one non-pivot column, the columns 
(and, therefore, the corresponding vectors) are linearly dependent. 


Q 
Example 2.14 
Consider R* with 
il i —1 
2 1 —2 
Tı = ae s T = 0 7 T3 = i (2.67) 
4 2 1 


To check whether they are linearly dependent, we follow the general ap- 
proach and solve 


1 1 —1 
2 1 —2 
A£ + À2£2 + À3£3 = AX 3 + A2 0 + A3 fils 0 (2.68) 
4 2 1 
for A,,...,A3. We write the vectors x;, i = 1,2,3, as the columns of a 


matrix and apply elementary row operations until we identify the pivot 
columns: 


iol = Wo oil 
ee: 0 1 0 
S ews va l (2.69) 
A 00 0 


Here, every column of the matrix is a pivot column. Therefore, there is no 
non-trivial solution, and we require \, = 0, A2 = 0,A3 = 0 to solve the 
equation system. Hence, the vectors 21, #2, a3 are linearly independent. 
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Remark. Consider a vector space V with k linearly independent vectors 


b,,..., 6b; and m linear combinations 
k 
Tı = y Abi, 
kA 
: (2.70) 
k 
s= 
Defining B = [b,,...,b,] as the matrix whose columns are the linearly 
independent vectors b,,...,b,, we can write 
x; = B\j, Aj = : ; j=l,...,m, (2.71) 
Akj 


in a more compact form. 

We want to test whether £4, ..., £m are linearly independent. For this 
purpose, we follow the general approach of testing when D pjzj = 0. 
With (2.71), we obtain 





S00; = So vj BA; = BY WjA;- (2.72) 
j=l j=l j=l 
This means that {x,,...,x,,} are linearly independent if and only if the 
column vectors {A;,...,A,,} are linearly independent. 
© 
Remark. In a vector space V, m linear combinations of k vectors £1,... , £p 
are linearly dependent if m > k. © 
Example 2.15 
Consider a set of linearly independent vectors b4, b2, b3, b4 € R” and 
Tı = bı =a 2b + bs = by 
GQ = —4b, = 2b ae 4b, 
T3 = 2b: AF 3b, = bs = 3b, ; a) 
t4 = 17b: = 10b, + 11b; a by 
Are the vectors %1,...,%4 E€ R” linearly independent? To answer this 
question, we investigate whether the column vectors 
1 —4 2 i 
—2 —2 3 —10 
1 |e On can a (2.74) 
—1 4 —3 1 
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are linearly independent. The reduced row-echelon form of the corre- 
sponding linear equation system with coefficient matrix 


A ee 17 
7; = BI 


sce ite Oe ther ae 

-1 4 -3 1 
is given as 

1 0 0 -7 

0 1 0 —15 

oaoa is Ee 

000 0 
We see that the corresponding linear equation system is non-trivially solv- 
able: The last column is not a pivot column, and x4 = —7£ı—15g£2—18£3. 
Therefore, xı, ..., £4 are linearly dependent as x4 can be expressed as a 
linear combination of £4,... , £3. 


2.6 Basis and Rank 


In a vector space V, we are particularly interested in sets of vectors A that 
possess the property that any vector v € V can be obtained by a linear 
combination of vectors in A. These vectors are special vectors, and in the 
following, we will characterize them. 


2.6.1 Generating Set and Basis 


Definition 2.13 (Generating Set and Span). Consider a vector space V = 
(V,+,-) and set of vectors A = {£1,..., £k} C V. If every vector v € 
Y can be expressed as a linear combination of £1,..., £p, A is called a 
generating set of V. The set of all linear combinations of vectors in A is 
called the span of A. If A spans the vector space V, we write V = span|.A] 
or V = span|a,..., 25]. 


Generating sets are sets of vectors that span vector (sub)spaces, i.e., 
every vector can be represented as a linear combination of the vectors 
in the generating set. Now, we will be more specific and characterize the 
smallest generating set that spans a vector (sub)space. 


Definition 2.14 (Basis). Consider a vector space V = (V,+,-) and A C 
VY. A generating set A of V is called minimal if there exists no smaller set 
AC ACY that spans V. Every linearly independent generating set of V 
is minimal and is called a basis of V. 
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Let V = (V,+,-) be a vector space and B C V,B Æ Ø. Then, the 
following statements are equivalent: 


= B is a basis of V. 

= B is a minimal generating set. 

« B is a maximal linearly independent set of vectors in V, i.e., adding any 
other vector to this set will make it linearly dependent. 


= Every vector æ € V is a linear combination of vectors from B, and every 
linear combination is unique, i.e., with 


k k 
s= X Abi = So Wid: (2.77) 
i=l w=1 
and à;, Y; € R, b; € B it follows that A; = y;,, i = 1,..., k. 
Example 2.16 


= In R°, the canonical/standard basis is 


1 0 0 
B= WE EO ; (2.78) 
0 0 1 
= Different bases in R? are 
1 1 1 0.5 1.8 —2.2 
ee aI EL fat anes eee ed es (ae aes eC) 
0 0 1 0.4 0.3 3.5 
= The set 
1 2 1 
2 —1 1 
A Pilea lesa lag (2.80) 
4 2 —4 


is linearly independent, but not a generating set (and no basis) of R*: 
For instance, the vector [1,0, 0, 0]' cannot be obtained by a linear com- 
bination of elements in A. 


Remark. Every vector space V possesses a basis B. The preceding exam- 
ples show that there can be many bases of a vector space V, i.e., there is 
no unique basis. However, all bases possess the same number of elements, 
the basis vectors. s 


We only consider finite-dimensional vector spaces V. In this case, the 
dimension of V is the number of basis vectors of V, and we write dim(V). 
If U C V is a subspace of V, then dim(U) < dim(V) and dim(U) = 
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dim(V) if and only if U = V. Intuitively, the dimension of a vector space 
can be thought of as the number of independent directions in this vector 
space. 


Remark. The dimension of a vector space is not necessarily the number 
: : 0j.. 
of elements in a vector. For instance, the vector space V = span| i | is 


one-dimensional, although the basis vector possesses two elements. © 


Remark. A basis of a subspace U = span|[a1,...,%m] C R” can be found 
by executing the following steps: 


1. Write the spanning vectors as columns of a matrix A 

2. Determine the row-echelon form of A. 

3. The spanning vectors associated with the pivot columns are a basis of 
U. 


© 
Example 2.17 (Determining a Basis) 
For a vector subspace U C R*, spanned by the vectors 
1 2 3 —1 
2 —1 —4 8 
tı i, zS l, sS 3l, ra |5 ER, 281) 
—1 2 5 —6 
—1 —2 —3 1 
we are interested in finding out which vectors 7,,..., a4 are a basis for U. 
For this, we need to check whether a,,...,x4 are linearly independent. 
Therefore, we need to solve 
4 
Soe = 0, (2.82) 
w=1 
which leads to a homogeneous system of equations with matrix 
eae 
2 --1 -4 8 
(£1, £2, £3, £4] = |-1 1 3 —-5]. (2.83) 
-1 2 5 -6 
-1 —2 -3 1 


With the basic transformation rules for systems of linear equations, we 
obtain the row-echelon form 


O E E 
| 2 -1 —4 3] | A = 
Sista ema sue ese 
-1 2 5 -6 OOR 
-1 —2 -3 1 OOO 
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Since the pivot columns indicate which set of vectors is linearly indepen- 
dent, we see from the row-echelon form that 21, x2, x4 are linearly inde- 
pendent (because the system of linear equations 1a, + A2%2 + Asx, = O 
can only be solved with Ay = Ay = A4 = 0). Therefore, {x,, 22, x4} isa 
basis of U. 


2.6.2 Rank 


The number of linearly independent columns of a matrix A € R”*” 
equals the number of linearly independent rows and is called the rank 
of A and is denoted by rk(A). 


Remark. The rank of a matrix has some important properties: 


= rk(A) = rk(A' ), i.e., the column rank equals the row rank. 

= The columns of A € R™*” span a subspace U C R™ with dim(U) = 
rk(A). Later we will call this subspace the image or range. A basis of 
U can be found by applying Gaussian elimination to A to identify the 
pivot columns. 

The rows of A € R™*” span a subspace W C R” with dim(W) = 


rk(A). A basis of W can be found by applying Gaussian elimination to 
A? 


For all A € R”*” it holds that A is regular (invertible) if and only if 
rk(A) =n. 

For all A € R”*” and all b € R” it holds that the linear equation 
system Ag = b can be solved if and only if rk( A) = rk(A|b), where 
A|b denotes the augmented system. 

For A € R™*” the subspace of solutions for Ax = 0 possesses dimen- 
sion n — rk( A). Later, we will call this subspace the kernel or the null 
space. 

A matrix A € R™*” has full rank if its rank equals the largest possible 
rank for a matrix of the same dimensions. This means that the rank of 
a full-rank matrix is the lesser of the number of rows and columns, i.e., 
rk(A) = min(m,n). A matrix is said to be rank deficient if it does not 
have full rank. 


Q 


Example 2.18 (Rank) 
TROR 
a= 008 il: 
000 
A has two linearly independent rows/columns so that rk( A) = 2. 
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1 il 
»#A=—|]-2 -3 1 
3 5 O 
We use Gaussian elimination to determine the rank: 
1 el eae | 
—2 —3 1 STES (al Bi x (2.84) 
3 5 O 00 0 


Here, we see that the number of linearly independent rows and columns 
is 2, such that rk(A) = 2. 


2.7 Linear Mappings 


In the following, we will study mappings on vector spaces that preserve 
their structure, which will allow us to define the concept of a coordinate. 
In the beginning of the chapter, we said that vectors are objects that can be 
added together and multiplied by a scalar, and the resulting object is still 
a vector. We wish to preserve this property when applying the mapping: 
Consider two real vector spaces V, W. A mapping ® : V — W preserves 
the structure of the vector space if 


O(a + y) = (x) + B(y) (2.85) 
(Ax) = A(x) (2.86) 


for all x,y € V and À € R. We can summarize this in the following 
definition: 


Definition 2.15 (Linear Mapping). For vector spaces V,W, a mapping 
® : V — W is called a linear mapping (or vector space homomorphism/ 
linear transformation) if 


Vg, y E VYA, Y E€ R : (Ax + py) = A(x) + PO(y). (2.87) 


It turns out that we can represent linear mappings as matrices (Sec- 
tion 2.7.1). Recall that we can also collect a set of vectors as columns of a 
matrix. When working with matrices, we have to keep in mind what the 
matrix represents: a linear mapping or a collection of vectors. We will see 
more about linear mappings in Chapter 4. Before we continue, we will 
briefly introduce special mappings. 


Definition 2.16 (Injective, Surjective, Bijective). Consider a mapping ® : 
Y — W, where V, W can be arbitrary sets. Then ® is called 


= Injective if Vx, y E V : (x) = (y) == z =y. 
« Surjective if ®(V) = W. 
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If ® is surjective, then every element in W can be “reached” from V 
using ®. A bijective ® can be “undone”, i.e., there exists a mapping W : 
W — VY so that Y o &(x) = x. This mapping Y is then called the inverse 
of ® and normally denoted by 7t. 

With these definitions, we introduce the following special cases of linear 
mappings between vector spaces V and W: 


= Isomorphism: ® : V — W linear and bijective 

» Endomorphism: ® : V —> V linear 

» Automorphism: ® : V — V linear and bijective 

= We define idy : V —> V, x > g as the identity mapping or identity 
automorphism in V. 


Example 2.19 (Homomorphism) 
The mapping ® : R? > C, P(x) = zı + ix2, is a homomorphism: 


j i i A) = (£1 + y1) + il£2 + Y2) = 21 + 1k. + Y1 + tye 


Holle) a(t) 
® (a A E e ey 


(2.88) 
This also justifies why complex numbers can be represented as tuples in 
R?: There is a bijective linear mapping that converts the elementwise addi- 
tion of tuples in IR? into the set of complex numbers with the correspond- 
ing addition. Note that we only showed linearity, but not the bijection. 


Theorem 2.17 (Theorem 3.59 in Axler (2015)). Finite-dimensional vector 
spaces V and W are isomorphic if and only if dim(V) = dim(W). 


Theorem 2.17 states that there exists a linear, bijective mapping be- 
tween two vector spaces of the same dimension. Intuitively, this means 
that vector spaces of the same dimension are kind of the same thing, as 
they can be transformed into each other without incurring any loss. 

Theorem 2.17 also gives us the justification to treat R™*” (the vector 
space of m x n-matrices) and R’™” (the vector space of vectors of length 
mm) the same, as their dimensions are mn, and there exists a linear, bi- 
jective mapping that transforms one into the other. 


Remark. Consider vector spaces V, W, X. Then: 


= For linear mappings 6 : V —> W and Y : W — X, the mapping 
Yo: V — X is also linear. 


« If 6 : V > W is an isomorphism, then ®~' : W — V is an isomor- 
phism, too. 
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ez 























ey 


elIf®:V 4W, U: VW are linear, then ® + W and A®, à € R, are 
linear, too. 


Q 


2.7.1 Matrix Representation of Linear Mappings 


Any n-dimensional vector space is isomorphic to R” (Theorem 2.17). We 
consider a basis {b;,...,b„} of an n-dimensional vector space V. In the 
following, the order of the basis vectors will be important. Therefore, we 
write 


B = (by,...,bn) (2.89) 


and call this n-tuple an ordered basis of V. 


Remark (Notation). We are at the point where notation gets a bit tricky. 


Therefore, we summarize some parts here. B = (b;,...,6,,) is an ordered 
basis, 6 = {b;,...,6,} is an (unordered) basis, and B = [b,,...,6,] isa 
matrix whose columns are the vectors 6;,..., by. © 


Definition 2.18 (Coordinates). Consider a vector space V and an ordered 
basis B = (b4, ..., bn) of V. For any æ € V we obtain a unique represen- 
tation (linear combination) 


of x with respect to B. Then a1,...,@n are the coordinates of x with 
respect to B, and the vector 


Oy 
a=) >| ei" (2.91) 


An 


is the coordinate vector/coordinate representation of x with respect to the 
ordered basis B. 
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A basis effectively defines a coordinate system. We are familiar with the 
Cartesian coordinate system in two dimensions, which is spanned by the 
canonical basis vectors e€), €z. In this coordinate system, a vector æ € R? 
has a representation that tells us how to linearly combine e, and ex to 
obtain x. However, any basis of R? defines a valid coordinate system, 
and the same vector x from before may have a different coordinate rep- 
resentation in the (b,, by) basis. In Figure 2.8, the coordinates of x with 
respect to the standard basis (€,, e2) is [2,2]'. However, with respect to 
the basis (bı, b2) the same vector x is represented as [1.09, 0.72]", i-e., 
xz = 1.09b, + 0.72b. In the following sections, we will discover how to 
obtain this representation. 


Example 2.20 

Let us have a look at a geometric vector x € R? with coordinates [2, 3]" 
with respect to the standard basis (e1, e2) of R?. This means, we can write 
x = 2e, + 3e2. However, we do not have to choose the standard basis to 
represent this vector. If we use the basis vectors b; = [1, —1]', b2 = [1,1]! 
we will obtain the coordinates ${—1,5]' to represent the same vector with 
respect to (b;, bz) (see Figure 2.9). 


Remark. For an n-dimensional vector space V and an ordered basis B 
of V, the mapping ® : R" > V, ®(e;) = b;, i = 1,...,n, is linear 
(and because of Theorem 2.17 an isomorphism), where (e1,..., €n) is 
the standard basis of R”. 


Q 


Now we are ready to make an explicit connection between matrices and 
linear mappings between finite-dimensional vector spaces. 


Definition 2.19 (Transformation Matrix). Consider vector spaces V, W 


with corresponding (ordered) bases B = (b,,...,6,,) and C = (c1,..., Cm). 


Moreover, we consider a linear mapping ® : V > W. For j € {1,...,n}, 
®(b;) = Q1jC1 raet AmjCm = 5 QijCi (2.92) 
i=l 


is the unique representation of ®(b;) with respect to C. Then, we call the 
m x n-matrix As, whose elements are given by 


Asli, j) = Qij, (2.93) 


the transformation matrix of ® (with respect to the ordered bases B of V 
and C of W). 


The coordinates of ®(b;) with respect to the ordered basis C of W 
are the j-th column of Aj». Consider (finite-dimensional) vector spaces 
V,W with ordered bases B,C and a linear mapping ®: V > W with 
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transformation matrix Ag. If x is the coordinate vector of x € V with 
respect to B and y the coordinate vector of y = ®(x) € W with respect 
to C, then 


y= Aor. (2.94) 


This means that the transformation matrix can be used to map coordinates 
with respect to an ordered basis in V to coordinates with respect to an 
ordered basis in W. 


Example 2.21 (Transformation Matrix) 

Consider a homomorphism ® : V — W and ordered bases B = 

(b,,...,b3) of V and C = (c1, .. . c4) of W. With 
(b1) = cı — C2 + 363 — C4 
(b2) = 2c1 + C2 + TC3 + 2c4 
(bz) = 3c2 + C3 + 4c4 


(2.95) 


the transformation matrix As with respect to B and C satisfies ®(b,) = 
S QikCi for k = 1,...,3 and is given as 


1 
—1 
3 
—1 


As = [Q1, A, a3] = 5 (2.96) 


ON Fe bh 
eo 


where the a,, j = 1, 2,3, are the coordinate vectors of ®(b;) with respect 
to C. 


Example 2.22 (Linear Transformations of Vectors) 





linear 


(b) Rotation by 45°. 


(c) Stretch along the (d) General 
horizontal axis. mapping. 


(a) Original data. 


We consider three linear transformations of a set of vectors in R? with 
the transformation matrices 


_ |cos(=) —sin(4) fo Suet 
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Figure 2.10 gives three examples of linear transformations of a set of vec- 
tors. Figure 2.10(a) shows 400 vectors in R?, each of which is represented 
by a dot at the corresponding (xı, x%2)-coordinates. The vectors are ar- 
ranged in a square. When we use matrix A, in (2.97) to linearly transform 
each of these vectors, we obtain the rotated square in Figure 2.10(b). If we 
apply the linear mapping represented by A», we obtain the rectangle in 
Figure 2.10(c) where each x,-coordinate is stretched by 2. Figure 2.10(d) 
shows the original square from Figure 2.10(a) when linearly transformed 
using A3, which is a combination of a reflection, a rotation, and a stretch. 


2.7.2 Basis Change 


In the following, we will have a closer look at how transformation matrices 
of a linear mapping ® : V — W change if we change the bases in V and 
W. Consider two ordered bases 


B = (bı,... bn), B= (bi,..., bn) (2.98) 
of V and two ordered bases 
C= (&,...;,€m) CSG ce) (2.99) 


of W. Moreover, As € R™*” is the transformation matrix of the linear 
mapping ® : V — W with respect to the bases B and C, and Aa € R™*” 
is the corresponding transformation mapping with respect to B and C. 
In the following, we will investigate how A and Aare related, i.e., how/ 
whether we can transform Ag» into As if we choose to perform a basis 
change from B,C to B,C. 


Remark. We effectively get different coordinate representations of the 
identity mapping idy. In the context of Figure 2.9, this would mean to 
map coordinates with respect to (e1, €2) onto coordinates with respect to 
(bı, b2) without changing the vector x. By changing the basis and corre- 
spondingly the representation of vectors, the transformation matrix with 
respect to this new basis can have a particularly simple form that allows 
for straightforward computation. ro) 


Example 2.23 (Basis Change) 
Consider a transformation matrix 


ame 
A= i; i (2.100) 


with respect to the canonical basis in R?. If we define a new basis 


Bo (l | 4) (2.101) 
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we obtain a diagonal transformation matrix 


9 3 0 
A= Fi | (2.102) 


with respect to B, which is easier to work with than A. 


In the following, we will look at mappings that transform coordinate 
vectors with respect to one basis into coordinate vectors with respect to 
a different basis. We will state our main result first and then provide an 
explanation. 


Theorem 2.20 (Basis Change). For a linear mapping ® : V — W, ordered 
bases 


B = (bı, oe Za Dia) B = (bı, oe .,b,) (2.103) 
of V and 
CAG ea Č = (č... Čm) (2.104) 


of W, and a transformation matrix A» of ® with respect to B and C, the 
corresponding transformation matrix Ax with respect to the bases B and C 
is given as 


As =T 'A3S. (2.105) 


Here, S € R”*” is the transformation matrix of idy that maps coordinates 
with respect to B onto coordinates with respect to B, and T € R™*" is the 
transformation matrix of idw that maps coordinates with respect to Č onto 
coordinates with respect to C. 


Proof Following Drumm and Weil (2001), we can write the vectors of 
the new basis B of V as a linear combination of the basis vectors of B, 
such that 


b; = sıjbi +: + Snjbn = X sibi, pS iy ka Nh (2.106) 
i=1 


Similarly, we write the new basis vectors Č of W as a linear combination 
of the basis vectors of C’, which yields 


Čr = tier t+ + tmkCm = >_ tne, eS Tag. (2.107) 
I=1 


We define S = ((s;;)) € R”*” as the transformation matrix that maps 
coordinates with respect to B onto coordinates with respect to B and 
T = ((tu)) € R”*” as the transformation matrix that maps coordinates 
with respect to Č onto coordinates with respect to C. In particular, the jth 
column of S is the coordinate representation of b; with respect to B and 
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the kth column of T is the coordinate representation of čą with respect to 
C. Note that both S and T are regular. 

We are going to look at (b;) from two perspectives. First, applying the 
mapping ®, we get that for all j = 1,...,n 


Db) = So Gye © 2 ay) tue = y (Enas) cı, (2.108) 
x k=1 


k=1 i=1 \k=1 
where we first expressed the new basis vectors č € W as linear com- 
binations of the basis vectors cç € W and then swapped the order of 
summation. 

Alternatively, when we express the b; € V as linear combinations of 
b; € V, we arrive at 


) @ 106) (5 Sij n) = > 84; = s2 auc, (2.109a) 
i=l 
i 3 (>: asu) cq, j=l,...,n, (2.109b) 


l=1 


‘= 


where we exploited the linearity of 6. Comparing (2.108) and (2.109b), 


it follows for all 7 = 1,...,nand/=1,...,m that 
XO tirärj = X ausij (2.110) 
k=1 i=l 
and, therefore, 
TAs = AS E R”, (2.111) 
such that 
Ås =T AaS, (2.112) 














which proves Theorem 2.20. 


_ Theorem 2.20 tells us that with a basis change in V (B is replaced with 
B) and W (C is replaced with C), the transformation matrix As of a 
linear mapping ® : V — W is replaced by an equivalent matrix A» with 


Ås = T! AzS. (2.113) 


Figure 2.11 illustrates this relation: Consider a homomorphism ® : V > 
W and ordered bases B, B of V and C,Č of W. The mapping ®cs is an 
instantiation of ® and maps basis vectors of B onto linear combinations 
of basis vectors of C. Assume that we know the transformation matrix As 
of Pcg with respect to the ordered bases B, C. When we perform a basis 
change from B to B in V and from C to Č in W, we can determine the 
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®:V — W and 
ordered bases B, B 
of V and C, Č of W 
(marked in blue), 
we can express the 
mapping #6 g with 
respect to the bases 
B, Č equivalently as 
a composition of the 
homomorphisms 
tog = 

Ecc ° cB VpB 
with respect to the 
bases in the 
subscripts. The 
corresponding 
transformation 
matrices are in red. 
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Vector spaces V -e W V — I W 
cg Pcg 
B —_ C B -a C 
Ordered bases vaals 1] Eee vaals | Feo = Ezė 
2 Ag k L As v 
B— ~~ C B——~ C 
CB CB 


corresponding transformation matrix Ax as follows: First, we find the ma- 
trix representation of the linear mapping Y p5 : V — V that maps coordi- 
nates with respect to the new basis B onto the (unique) coordinates with 
respect to the “old” basis B (in V). Then, we use the transformation ma- 
trix As of ®gz : V — W to map these coordinates onto the coordinates 
with respect to Cin W. Finally, we use a linear mapping Ze, : W > W 
to map the coordinates with respect to C onto coordinates with respect to 
C’. Therefore, we can express the linear mapping ®¢, as a composition of 
linear mappings that involve the “old” basis: 

Pop = Eco ° Bop 0 Ugg = EEG ° Bos o Ups. (2.114) 
Concretely, we use Vp = idy and S.¢ = idw, i-e., the identity mappings 
that map vectors onto themselves, but with respect to a different basis. 


Definition 2.21 (Equivalence). Two matrices A, A € R”*” are equivalent 
if there exist regular matrices S € R”*” and T € R”%™, such that 
A=T™'AS. 


Definition 2.22 (Similarity). Two matrices A, A € R”*” are similar if 
there exists a regular matrix S € R”*” with A = STAS 


Remark. Similar matrices are always equivalent. However, equivalent ma- 
trices are not necessarily similar. > 


Remark. Consider vector spaces V,W,X. From the remark that follows 
Theorem 2.17, we already know that for linear mappings 6 : V — W 
and Y : W — X the mapping Vo@d: V > X is also linear. With 
transformation matrices As and Ay of the corresponding mappings, the 
overall transformation matrix is Ayo = Ay Ás. 


In light of this remark, we can look at basis changes from the perspec- 
tive of composing linear mappings: 


« Aj is the transformation matrix of a linear mapping cg : V > W 
with respect to the bases B,C. 

= Aj is the transformation matrix of the linear mapping ag: V >W 
with respect to the bases B, Č. 

« S is the transformation matrix of a linear mapping Y5 : V > V 
(automorphism) that represents B in terms of B. Normally, U = idy is 
the identity mapping in V. 


Draft (2021-01-14) of “Mathematics for Machine Learning”. Feedback: https: //mm1-book.com. 


2.7 Linear Mappings 57 


= T is the transformation matrix of a linear mapping Egg : W > W 
(automorphism) that represents C' in terms of C’. Normally, = = idy is 
the identity mapping in W. 


If we (informally) write down the transformations just in terms of bases, 


then As : B > C, As : B > Č, S : B > B, T : Č > C and 
T~ : C —> C, and 

BaC=83 2430 5C (2.115) 

As =T AaS. (2.116) 


Note that the execution order in (2.116) is from right to left because vec- 
tors are multiplied at the right-hand side so that x œ> Sx œ> As (Sx) > 
T '(Ae(Sz)) = Asx. 


Example 2.24 (Basis Change) 
Consider a linear mapping ® : R® — R* whose transformation matrix is 


IEO O 
-1 1 3 
As = 3 71 (2.117) 
-1 2 4 
with respect to the standard bases 
1) fo} fo ol fal fof fo 
B=( 0 ’ 1 , 0 JE C= ( ’ , ’ J (2.118) 
0 0 1 0 0 1 0 
0 0 0 1 








We seek the transformation matrix A» of ® with respect to the new bases 








Bela ella ere oer ee =a A ; „|a. (2.119) 
a o EAO 
oj ioj) loj {1 
Then, 
S=]|1 1 0j, T = : (2.120) 
raat 0110 
0001 


where the ith column of S is the coordinate representation of b; in 
terms of the basis vectors of B. Since B is the standard basis, the co- 
ordinate representation is straightforward to find. For a general basis B, 
we would need to solve a linear equation system to find the A; such that 
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Ya à;b; = b;, j =1,...,3. Similarly, the jth column of T is the coordi- 
nate representation of ¢; in terms of the basis vectors of C. 
Therefore, we obtain 


ok aly] | 
1} 1 -1 1 -1])0 


2 
Ao =I 'A3S = i 2 (2.121a) 
os Deo Weiee Oa Sea 
OO O Om IPI tre seen 

aia 
6. 00 "0 

Sila coger (2.121b) 
ie SG = oe: 


In Chapter 4, we will be able to exploit the concept of a basis change 
to find a basis with respect to which the transformation matrix of an en- 
domorphism has a particularly simple (diagonal) form. In Chapter 10, we 
will look at a data compression problem and find a convenient basis onto 
which we can project the data while minimizing the compression loss. 


2.7.3 Image and Kernel 


The image and kernel of a linear mapping are vector subspaces with cer- 
tain important properties. In the following, we will characterize them 
more carefully. 


Definition 2.23 (Image and Kernel). 
For ® : V + W, we define the kernel/null space 


ker(®) := P! (0w) = {v € V : (v) = 0w} (2.122) 
and the image/range 


Im(®) := ®(V) = {w e W[w € V : (v) = w}. (2.123) 





We also call V and W also the domain and codomain of ®, respectively. 


Intuitively, the kernel is the set of vectors v € V that ® maps onto the 
neutral element Ow € W. The image is the set of vectors w € W that 
can be “reached” by ® from any vector in V. An illustration is given in 
Figure 2.12. 


Remark. Consider a linear mapping ® : V —> W, where V, W are vector 
spaces. 


= It always holds that ®(Oy) = Ow and, therefore, Oy € ker(®). In 
particular, the null space is never empty. 
= Im(®) C W is a subspace of W, and ker(®) C V is a subspace of V. 
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Figure 2.12 Kernel 
V and image of a 
linear mapping 
&:V >W. 





= ® is injective (one-to-one) if and only if ker(®) = {0}. 


Q 


Remark (Null Space and Column Space). Let us consider A € R™*” and 
a linear mapping ® : R” > R”, x > Az. 


= For A = |a;,..., an], where a; are the columns of A, we obtain 


Im() = {Ax : x € R”} = [5 Tili : Z1,- --} Ln E r} (2.124a) 
i=l 


= span[a;,...,@,] CR”, (2.124b) 


i.e., the image is the span of the columns of A, also called the column column space 
space. Therefore, the column space (image) is a subspace of R”, where 
m is the “height” of the matrix. 

= rk(A) = dim(Im(®)). 

= The kernel/null space ker(®) is the general solution to the homoge- 
neous system of linear equations Ax = 0 and captures all possible 
linear combinations of the elements in R” that produce 0 € R”. 

= The kernel is a subspace of R”, where n is the “width” of the matrix. 

= The kernel focuses on the relationship among the columns, and we can 
use it to determine whether/how we can express a column as a linear 
combination of other columns. 


© 
Example 2.25 (Image and Kernel of a Linear Mapping) 
The mapping 
Tı Tı 
ape 2 T2 1 2 -1 0 LQ) £1 + 2£2 — T3 
ere T3 aF 0 0 1 T3 7 Li t LA 
T4 T4 
(2.125a) 
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=T; A + 25 a + T3 D AE apa H (2.125b) 


is linear. To determine Im(®), we can take the span of the columns of the 
transformation matrix and obtain 


Im(®) = span ; A 3 Pa ; Ale (2.126) 


To compute the kernel (null space) of ©, we need to solve Ax = 0, i.e., 
we need to solve a homogeneous equation system. To do this, we use 
Gaussian elimination to transform A into reduced row-echelon form: 


1 2 -1 0 LO 
i 0 0 l Eo 0 1 -i aN ee 


This matrix is in reduced row-echelon form, and we can use the Minus- 
1 Trick to compute a basis of the kernel (see Section 2.3.3). Alternatively, 
we can express the non-pivot columns (columns 3 and 4) as linear com- 
binations of the pivot columns (columns 1 and 2). The third column az is 
equivalent to -4 times the second column ay. Therefore, 0 = a3+ ža. In 
the same way, we see that a4 = a,— ža and, therefore, 0 = a,— taz —ay4. 
Overall, this gives us the kernel (null space) as 


=] 


ker(®) = span| (2.128) 


Oo =ni © 
= Onl 


Theorem 2.24 (Rank-Nullity Theorem). For vector spaces V, W and a lin- 
ear mapping ® : V — W it holds that 


dim(ker(®)) + dim(Im(®)) = dim(V) . (2.129) 


The rank-nullity theorem is also referred to as the fundamental theorem 
of linear mappings (Axler, 2015, theorem 3.22). The following are direct 
consequences of Theorem 2.24: 


If dim(Im(®)) < dim(V), then ker(®) is non-trivial, i.e., the kernel 
contains more than Oy and dim(ker(®)) > 1. 

If Ag is the transformation matrix of ® with respect to an ordered basis 
and dim(Im(®)) < dim(V), then the system of linear equations Asx = 
0 has infinitely many solutions. 

If dim(V) = dim(W), then the following three-way equivalence holds: 


— @ is injective 

— @ is surjective 
— @ is bijective 
since Im(®) C W. 
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2.8 Affine Spaces 


In the following, we will have a closer look at spaces that are offset from 
the origin, i.e., spaces that are no longer vector subspaces. Moreover, we 
will briefly discuss properties of mappings between these affine spaces, 
which resemble linear mappings. 

Remark. In the machine learning literature, the distinction between linear 
and affine is sometimes not clear so that we can find references to affine 
spaces/mappings as linear spaces/mappings. ro 


2.8.1 Affine Subspaces 


Definition 2.25 (Affine Subspace). Let V be a vector space, £o € V and 
U C V a subspace. Then the subset 


L = zo +U := {zot u: uc U} (2.130a) 
= {v € V|Ju € U :v = zto+u} CV (2.130b) 





is called affine subspace or linear manifold of V. U is called direction or 
direction space, and ao is called support point. In Chapter 12, we refer to 
such a subspace as a hyperplane. 


Note that the definition of an affine subspace excludes 0 if a ¢ U. 
Therefore, an affine subspace is not a (linear) subspace (vector subspace) 
of V for xo ¢ U. 

Examples of affine subspaces are points, lines, and planes in R*, which 
do not (necessarily) go through the origin. 


Remark. Consider two affine subspaces L = xo + U and L= Zo + U ofa 
vector space V. Then, L C L if and only if U C U and a — Žo € Ü. 

Affine subspaces are often described by parameters: Consider a k-dimen- 
sional affine space L = a) + U of V. If (b1,..., b;) is an ordered basis of 
U, then every element æ € L can be uniquely described as 


T = £o + à1b1 +... + Andy, (2.131) 
where à1,..., Ax E R. This representation is called parametric equation 
of L with directional vectors b;,..., 6, and parameters 1,..., Xz. & 


Example 2.26 (Affine Subspaces) 


= One-dimensional affine subspaces are called lines and can be written 
as y = £o + bı, where  € Rand U = span|b,| C R” is a one- 
dimensional subspace of R”. This means that a line is defined by a sup- 
port point £o and a vector b, that defines the direction. See Figure 2.13 
for an illustration. 
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= Two-dimensional affine subspaces of R” are called planes. The para- 
metric equation for planes is y = £o + àı1b1 + Azgbe, where à1, à2 E R 
and U = span[|b;, b2] © R”. This means that a plane is defined by a 
support point £o and two linearly independent vectors b4, b that span 
the direction space. 

In R”, the (n — 1)-dimensional affine subspaces are called hyperplanes, 


and the corresponding parametric equation is y = £o + oe Ajj, 


where b,,...,6,—; form a basis of an (n — 1)-dimensional subspace 
U of R”. This means that a hyperplane is defined by a support point 
Zy and (n — 1) linearly independent vectors b,,...,6,,—; that span the 


direction space. In R?, a line is also a hyperplane. In R?, a plane is also 
a hyperplane. 





Remark (Inhomogeneous systems of linear equations and affine subspaces). 
For A € R™*" and a € R”, the solution of the system of linear equa- 
tions AA = g is either the empty set or an affine subspace of R” of 
dimension n — rk( A). In particular, the solution of the linear equation 
àibı +... + Ànbn = x, where (à1,..., An) Æ (0,...,0), is a hyperplane 
in R”. 

In R”, every k-dimensional affine subspace is the solution of an inho- 
mogeneous system of linear equations Az = b, where A € R”*",b © 
R” and rk(A) = n — k. Recall that for homogeneous equation systems 
Az = 0 the solution was a vector subspace, which we can also think of 
as a special affine space with support point a) = 0. © 


2.8.2 Affine Mappings 


Similar to linear mappings between vector spaces, which we discussed 
in Section 2.7, we can define affine mappings between two affine spaces. 
Linear and affine mappings are closely related. Therefore, many properties 
that we already know from linear mappings, e.g., that the composition of 
linear mappings is a linear mapping, also hold for affine mappings. 


Definition 2.26 (Affine Mapping). For two vector spaces V, W, a linear 


Draft (2021-01-14) of “Mathematics for Machine Learning”. Feedback: https: //mm1-book.com. 


2.9 Further Reading 63 


mapping 6: V — W, and a € W, the mapping 


ENW (2.132) 
rrat O(a) (2.133) 


is an affine mapping from V to W. The vector a is called the translation 
vector of ©. 


= Every affine mapping ¢ : V — W is also the composition of a linear 
mapping ® : V — W and atranslation r : W — W in W, such that 
$ = T o È. The mappings ® and 7 are uniquely determined. 

= The composition ¢’ o ¢ of affine mappings 6: V > W, # : W > X is 
affine. 

= Affine mappings keep the geometric structure invariant. They also pre- 
serve the dimension and parallelism. 


2.9 Further Reading 


There are many resources for learning linear algebra, including the text- 
books by Strang (2003), Golan (2007), Axler (2015), and Liesen and 
Mehrmann (2015). There are also several online resources that we men- 
tioned in the introduction to this chapter. We only covered Gaussian elim- 
ination here, but there are many other approaches for solving systems of 
linear equations, and we refer to numerical linear algebra textbooks by 
Stoer and Burlirsch (2002), Golub and Van Loan (2012), and Horn and 
Johnson (2013) for an in-depth discussion. 

In this book, we distinguish between the topics of linear algebra (e.g., 
vectors, matrices, linear independence, basis) and topics related to the 
geometry of a vector space. In Chapter 3, we will introduce the inner 
product, which induces a norm. These concepts allow us to define angles, 
lengths and distances, which we will use for orthogonal projections. Pro- 
jections turn out to be key in many machine learning algorithms, such as 
linear regression and principal component analysis, both of which we will 
cover in Chapters 9 and 10, respectively. 
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Exercises 
2.1 We consider (IR\{—1}, x), where 
axb:=ab+a+t+b, a,b € R\{-1} (2.134) 


a. Show that (IR\{—1}, x) is an Abelian group. 
b. Solve 


3xrxr= 15 


in the Abelian group (IR\{—1}, x), where x is defined in (2.134). 
2.2 Let n be in N\{0}. Let k, x be in Z. We define the congruence class k of the 
integer k as the set 
k={xe€Z\|x—k=0 (modn)} 
={t#EZ|saaeZ: («—-k=n-a)}. 





We now define Z/nZ (sometimes written Z,,) as the set of all congruence 
classes modulo n. Euclidean division implies that this set is a finite set con- 
taining n elements: 


Zn = {0,1,...,n—T} 


For all a,b € Zn, we define 





@b:=atb 


Ql 


a. Show that (Zn, ®) is a group. Is it Abelian? 
b. We now define another operation & for all a and b in Zn as 


agb=axb, (2.135) 


where a x b represents the usual multiplication in Z. 
Let n = 5. Draw the times table of the elements of Z5\{0} under Q, i.e., 
calculate the products @ @ 6 for all a and b in Z5\{0}. 
Hence, show that Z5\{0} is closed under @ and possesses a neutral 
element for @. Display the inverse of all elements in Z;\{0} under @. 
Conclude that (Z5\{0}, @) is an Abelian group. 

c. Show that (Zg\{0}, @) is not a group. 

d. We recall that the Bézout theorem states that two integers a and b are 
relatively prime (i.e., gcd(a, b) = 1) if and only if there exist two integers 
u and v such that au + bv = 1. Show that (Zn\{0}, @) is a group if and 
only if n € IN\{0} is prime. 


2.3. Consider the set G of 3 x 3 matrices defined as follows: 


1 
G=4 |0 
0 


Ors 


z 
y eR? zy,zER 
1 


We define - as the standard matrix multiplication. 
Is (G,-) a group? If yes, is it Abelian? Justify your answer. 
2.4 Compute the following matrix products, if possible: 
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a. 
1 2} }1 1 0 
4 5| |O 1 1 
7 8j {1 0 1 
b. 
1 2 3) }1 1 «0 
4 5 6 1 
7 8 9| [1 
Cc. 
1 1 1 2 3 
1 1) {4 5 6 
1 1] [7 8 9 
d. 
el 
F o2- -1I 2 l —1 
4 1 -1 —4| |2 1 
5 2 
e. 
0 3 
1-1} j1 2 #1 2 
Die? od. 4 1 -1 -4 
5 2 


2.5 Find the set S of all solutions in æ of the following inhomogeneous linear 
systems Ax = b, where A and b are defined as follows: 


a. 
1. “fa * SOR ag 1 
gel? SF g 
2 -1 1 3 4 
5 2 -4 2 6 
b. 
1 -1 0 0 1 3 
ict a bee 
Seat a. 5 
Zi De Si -1 


2.6 Using Gaussian elimination, find all solutions of the inhomogeneous equa- 
tion system Ag = b with 


010010 2 
A=ļl0 0 0 1 ı of, b=]ļ|-1 
010001 1 
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T1 
2.7 Find all solutions in æ = |xz2| € IR® of the equation system Aa = 12a, 
T3 
where 
6 4 3 
A=j|6 0 9 
0 8 0 


and )x3_, 2; =1. 
2.8 Determine the inverses of the following matrices if possible: 


a. 


oe W 
a oe 


© 
Per oO 
rFOrREFR 
D e O 


2.9 Which of the following sets are subspaces of R°? 
a ASO +? dw) Awe RI} 
b. B = {(å?,—3?,0) |à € R} 
c. Let y be in R. 
C = { (£1, £2, £3) E€ R? | &1 — 262 + 3&3 = 7} 
d. D = {(&1, 2,3) € R? | & € Z} 


2.10 Are the following sets of vectors linearly independent? 


a. 
2 1 3 
r= —1 ; T2 = 1 ; T3 = —3 
3 —2 8 
b. 
1 1 1 
2 1 0 
Ti = 1 ’ £2 = 0 ; v3 = 0 
0 1 1 
0 1 1 
2.11 Write 
1 
y= |-2 
5 
as linear combination of 
1 1 2 
v= 1 $ T2 = 2 5 T3 = —1 
1 1 
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2.12 Consider two subspaces of Rf: 
1 2 f re 2 ] ee 
1 —1 1 =2 —2 6 

—3 , 0 ? —1 ] Z U2 = span[ 2 ? 0 $ —?2 ] = 
1 —1 1 1 0 —1 


Determine a basis of U1 N U2. 

2.13 Consider two subspaces U and U2, where U; is the solution space of the 
homogeneous equation system A;az = 0 and U2 is the solution space of the 
homogeneous equation system Aza = 0 with 


U, = span[ 


1 0 1 3 -3 0 
1 -2 -1 1 2 3 
“=j ia 3l =l? 2 
1 0 1 3 -1 2 


a. Determine the dimension of Uj, U2. 
b. Determine bases of U, and Up. 
c. Determine a basis of U1 N U2. 


2.14 Consider two subspaces U; and U2, where U; is spanned by the columns of 
A, and Uz is spanned by the columns of A» with 


1 0 t] f =3 
T= 1 1 2 
BPE os i 3| A= |e eg 
1 0 1 Be eT 


a. Determine the dimension of U;, U2 
b. Determine bases of U; and U2 
c. Determine a basis of U1 N U2 
2.15 Let F = {(x,y,z) € R? | e+y—z = 0} and G = {(a—b, a+b, a—3b) | a,b € R}. 
a. Show that F and G are subspaces of R3. 
b. Calculate F N G without resorting to any basis vector. 


c. Find one basis for F and one for G, calculate FNG using the basis vectors 
previously found and check your result with the previous question. 


wow WO 


2.16 Are the following mappings linear? 
a. Leta,bE R. 


©: L (fa, b]) > R 
b 
f> | Hejar, 
where L! ([a,b]) denotes the set of integrable functions on [a, b]. 
®: Ct = o° 
foafp=f, 


where for k > 1, C* denotes the set of k times continuously differen- 
tiable functions, and C° denotes the set of continuous functions. 
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c. 
@:R-R 
x> (x) = cos(x) 
d. 
® : R? > R? 
ál 1 2 3 
si ae ea | fa 


e. Let 6 be in [0,27[ and 
ð: R? >R? 
cos(#)  sin(0) 
E h sin(O) cos(8) £ 


2.17 Consider the linear mapping 


ð: RË => RÍ 
3x1 + 2£2 + £3 
Tı 
al lz _ | 71 +22 +23 
£1 — 3x2 
241 + 3%2 + £3 





= Find the transformation matrix Ag. 


= Determine rk(A@). 
= Compute the kernel and image of 6. What are dim(ker(®)) and dim(Im(®))? 


2.18 Let E be a vector space. Let f and g be two automorphisms on E such that 
fog = idg (i.e., f o g is the identity mapping idg). Show that ker(f) = 
ker(g o f), Im(g) = Im(g 0 f) and that ker(f) N Im(g) = {Oz}. 

2.19 Consider an endomorphism ® : R? — R whose transformation matrix 
(with respect to the standard basis in R3) is 


1 1 0 
As=l1 -1 0 
1 1 1 


a. Determine ker(®) and Im(®). 
b. Determine the transformation matrix Ag with respect to the basis 


1 1 1 
B = ( 1 ? , 0 ) 7 
1 1 0 
i.e., perform a basis change toward the new basis B. 
2.20 Let us consider b1, b2, b}, bh, 4 vectors of R? expressed in the standard basis 


of R? as 
_ {2 _|-l , | 2 , _ |i 
m= Gf s[i -i 
and let us define two ordered bases B = (bi, bz) and B’ = (bj, b) of R?. 
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Show that B and B’ are two bases of IR? and draw those basis vectors. 
Compute the matrix P, that performs a basis change from B’ to B. 

We consider c1, c2, c3, three vectors of IR? defined in the standard basis 
of R? as 


Hi 
© 
m= 


C] = 2 s C2 = —1 $ c= 0 
—1 2 —1 


and we define C = (c1, c2, c3). 


(i) Show that C is a basis of R?, e.g., by using determinants (see 
Section 4.1). 

Gi) Let us call C” = (c4, ch, ch) the standard basis of R3. Determine 
the matrix P2 that performs the basis change from C to C’. 


. We consider a homomorphism © : R? —> R®, such that 


(bi +b2) = co+c3 
(bı = b2) = 2c1 — C2 + 3c3 


where B = (b1, b2) and C = (c1, c2, c3) are ordered bases of Rĉ and R, 
respectively. 

Determine the transformation matrix Ag of ® with respect to the or- 
dered bases B and C. 

Determine A’, the transformation matrix of 6 with respect to the bases 
B' and C”. 

Let us consider the vector æ € R? whose coordinates in B’ are [2,3]! . 
In other words, x = 2b), + 3b5. 


(i) Calculate the coordinates of æ in B. 
(ii) Based on that, compute the coordinates of (a) expressed in C. 
(iii) Then, write (a) in terms of c},c5,¢3. 
(iv) Use the representation of æ in B’ and the matrix A’ to find this 
result directly. 
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Analytic Geometry 


In Chapter 2, we studied vectors, vector spaces, and linear mappings at 
a general but abstract level. In this chapter, we will add some geomet- 
ric interpretation and intuition to all of these concepts. In particular, we 
will look at geometric vectors and compute their lengths and distances 
or angles between two vectors. To be able to do this, we equip the vec- 
tor space with an inner product that induces the geometry of the vector 
space. Inner products and their corresponding norms and metrics capture 
the intuitive notions of similarity and distances, which we use to develop 
the support vector machine in Chapter 12. We will then use the concepts 
of lengths and angles between vectors to discuss orthogonal projections, 
which will play a central role when we discuss principal component anal- 
ysis in Chapter 10 and regression via maximum likelihood estimation in 
Chapter 9. Figure 3.1 gives an overview of how concepts in this chapter 
are related and how they are connected to other chapters of the book. 
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3.1 Norms 


When we think of geometric vectors, i.e., directed line segments that start 
at the origin, then intuitively the length of a vector is the distance of the 
“end” of this directed line segment from the origin. In the following, we 
will discuss the notion of the length of vectors using the concept of a norm. 


Definition 3.1 (Norm). A norm on a vector space V is a function 


l: V >R, 
z> |æll, 


(3.1) 
(3.2) 


which assigns each vector æ its length ||æ|| € R, such that for all A € R 
and x,y € V the following hold: 


= Absolutely homogeneous: ||Aæ|| = |A|||æ|| 
=" Triangle inequality: ||a + y|| < ||x|| + |lyl| 
= Positive definite: ||a|| > 0 and ||jz|| =0 — = x=0 


In geometric terms, the triangle inequality states that for any triangle, 
the sum of the lengths of any two sides must be greater than or equal 
to the length of the remaining side; see Figure 3.2 for an illustration. 
Definition 3.1 is in terms of a general vector space V (Section 2.4), but 
in this book we will only consider a finite-dimensional vector space R”. 
Recall that for a vector x € R” we denote the elements of the vector using 


a subscript, that is, x; is the i‘* element of the vector æ. 


Example 3.1 (Manhattan Norm) 
The Manhattan norm on R” is defined for x € R” as 


n 
læli = $ lz:l, 
i=1 


where | - | is the absolute value. The left panel of Figure 3.3 shows all 
vectors x € R? with ||a||, = 1. The Manhattan norm is also called 4 
norm. 


(3.3) 
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Figure 3.3 For 
different norms, the 
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with norm 1. Left: 
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Right: Euclidean 
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Example 3.2 (Euclidean Norm) 
The Euclidean norm of x € R” is defined as 


(3.4) 


læll2 := 





and computes the Euclidean distance of x from the origin. The right panel 
of Figure 3.3 shows all vectors x € R? with ||a||2 = 1. The Euclidean 
norm is also called 4 norm. 


Remark. Throughout this book, we will use the Euclidean norm (3.4) by 
default if not stated otherwise. © 


3.2 Inner Products 


Inner products allow for the introduction of intuitive geometrical con- 
cepts, such as the length of a vector and the angle or distance between 
two vectors. A major purpose of inner products is to determine whether 
vectors are orthogonal to each other. 


3.2.1 Dot Product 


We may already be familiar with a particular type of inner product, the 
scalar product/dot product in R”, which is given by 


ey = 5 LiYi - (3.5) 
i=l 


We will refer to this particular inner product as the dot product in this 
book. However, inner products are more general concepts with specific 
properties, which we will now introduce. 


3.2.2 General Inner Products 


Recall the linear mapping from Section 2.7, where we can rearrange the 
mapping with respect to addition and multiplication with a scalar. A bi- 
linear mapping 2. is a mapping with two arguments, and it is linear in 
each argument, i.e., when we look at a vector space V then it holds that 
for all x,y,z € V, A, Y € R that 


Q(Ax + py, z) = AQ (æ, z) + YRly, z) (3.6) 
Q(x, Ay + wz) = AD(a, y) + O(a, z). (3.7) 
Here, (3.6) asserts that 2) is linear in the first argument, and (3.7) asserts 


that Q is linear in the second argument (see also (2.87)). 


Draft (2021-01-14) of “Mathematics for Machine Learning”. Feedback: https ://mml-book. com. 


3.2 Inner Products 73 


Definition 3.2. Let V be a vector space and 92: V x V > R bea bilinear 
mapping that takes two vectors and maps them onto a real number. Then 


= Q is called symmetric if Q(x, y) = Q(y, x) for all x,y € V, i.e., the 
order of the arguments does not matter. 
= Q is called positive definite if 


Va €V\{0O}: Q(x,x2)>0, Q(0,0)=0. (3.8) 


Definition 3.3. Let V be a vector space and 922: V x V > R bea bilinear 
mapping that takes two vectors and maps them onto a real number. Then 


= A positive definite, symmetric bilinear mapping 2: Vx V — Ris called 
an inner product on V. We typically write (æ, y} instead of Q(x, y). 

= The pair (V, (-,-)) is called an inner product space or (real) vector space 
with inner product. If we use the dot product defined in (3.5), we call 
(V, (-,-)) a Euclidean vector space. 


We will refer to these spaces as inner product spaces in this book. 


Example 3.3 (Inner Product That Is Not the Dot Product) 
Consider V = R?. If we define 


(£, Y) := T1Y1 — (L1Y2 + Lay1) + 22eyo (3.9) 


then (-,-} is an inner product but different from the dot product. The proof 
will be an exercise. 


3.2.3 Symmetric, Positive Definite Matrices 


Symmetric, positive definite matrices play an important role in machine 
learning, and they are defined via the inner product. In Section 4.3, we 
will return to symmetric, positive definite matrices in the context of matrix 
decompositions. The idea of symmetric positive semidefinite matrices is 
key in the definition of kernels (Section 12.4). 

Consider an n-dimensional vector space V with an inner product (.,-) : 
V x V —> R (see Definition 3.3) and an ordered basis B = (b,,...,6,,) of 
V. Recall from Section 2.6.1 that any vectors x,y € V can be written as 
linear combinations of the basis vectors so that x = 5~'"_, Wb; € V and 
y= eS A;b; € V for suitable ~;, A; € R. Due to the bilinearity of the 
inner product, it holds for all x,y € V that 


(x,y) = > nb, X> bs) = YY y (b;,b;) Aj = @' AY, (3.10) 
= J= wi jai 


where A;; := (b;,b;) and &, y are the coordinates of x and y with respect 
to the basis B. This implies that the inner product (-,-) is uniquely deter- 
mined through A. The symmetry of the inner product also means that A 
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is symmetric. Furthermore, the positive definiteness of the inner product 
implies that 


Va €V\{0O}:a' Ax >0. (3.11) 


Definition 3.4 (Symmetric, Positive Definite Matrix). A symmetric matrix 
A € R”*” that satisfies (3.11) is called symmetric, positive definite, or 
just positive definite. If only > holds in (3.11), then A is called symmetric, 
positive semidefinite. 


Example 3.4 (Symmetric, Positive Definite Matrices) 
Consider the matrices 


9 6 9 6 
ee eee ea 
A, is positive definite because it is symmetric and 
9 6 Ly 
xv' Aix = [a4 zə] i 4 (3.13a) 


= Or? + 129,72 + 5x3 = (82, +222)? +22 >0 (3.13b) 


for all « € V\{0}. In contrast, A, is symmetric but not positive definite 
because x' Agr = 9x7 + 122,22 + 3x3 = (321 + 2x2)? — x3 can be less 
than 0, e.g., for x = [2,—3]'. 


If A € R”*” is symmetric, positive definite, then 


(x,y) =a' Ay (3.14) 


defines an inner product with respect to an ordered basis B, where £ and 
y are the coordinate representations of x, y € V with respect to B. 


Theorem 3.5. For a real-valued, finite-dimensional vector space V and an 
ordered basis B of V, it holds that (-,-} : V x V —> R is an inner product if 
and only if there exists a symmetric, positive definite matrix A € R”*” with 


(x,y) =ĉ2 Aĵ. (3.15) 


The following properties hold if A € R”*” is symmetric and positive 
definite: 


= The null space (kernel) of A consists only of 0 because x' Ax > 0 for 
all x # O. This implies that Aa Æ 0 if x £ 0. 


= The diagonal elements a;; of A are positive because aj; = e] Ae; >0, 
where e; is the ith vector of the standard basis in R”. 
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3.3 Lengths and Distances 


In Section 3.1, we already discussed norms that we can use to compute 
the length of a vector. Inner products and norms are closely related in the 
sense that any inner product induces a norm 


læl| := y (x, £) (3.16) 


in a natural way, such that we can compute lengths of vectors using the in- 
ner product. However, not every norm is induced by an inner product. The 
Manhattan norm (3.3) is an example of a norm without a corresponding 
inner product. In the following, we will focus on norms that are induced 
by inner products and introduce geometric concepts, such as lengths, dis- 
tances, and angles. 


Remark (Cauchy-Schwarz Inequality). For an inner product vector space 
(V, (-,-)) the induced norm || - || satisfies the Cauchy-Schwarz inequality 
| (Œy) | < llælllyll (3.17) 


Q 


Example 3.5 (Lengths of Vectors Using Inner Products) 

In geometry, we are often interested in lengths of vectors. We can now use 
an inner product to compute them using (3.16). Let us take æ = [1,1] € 
R?. If we use the dot product as the inner product, with (3.16) we obtain 


æl = Vale = V12?4+ 122 = v2 (3.18) 


as the length of a. Let us now choose a different inner product: 
rfl -3 
(x,y) =£ | ı 1 |Y 5T- z (2192 + £241) tots, (3.19) 
2 


If we compute the norm of a vector, then this inner product returns smaller 
values than the dot product if xı and x» have the same sign (and zız2 > 
0); otherwise, it returns greater values than the dot product. With this 
inner product, we obtain 


(ee) = 27 — tizro tr =1-1+1=1 => |z| =v1=1, (3.20) 
such that æ is “shorter” with this inner product than with the dot product. 


Definition 3.6 (Distance and Metric). Consider an inner product space 
(V, (-,-)). Then 


d(z,y) := |æ — y|| = y (£ - y, £ — y) (3.21) 


is called the distance between æ and y for x,y € V. If we use the dot 
product as the inner product, then the distance is called Euclidean distance. 
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The mapping 


d:VxV >R 
(x,y) +> d(x,y) 


(3.22) 
(3.23) 


is called a metric. 


Remark. Similar to the length of a vector, the distance between vectors 
does not require an inner product: a norm is sufficient. If we have a norm 
induced by an inner product, the distance may vary depending on the 
choice of the inner product. ro 


A metric d satisfies the following: 


1. d is positive definite, i.e., d(x, y) > 0 for all x,y € V and d(x, y) = 
0 <=> r=y. 

2. dis symmetric, i.e., d(x, y) = d(y, x) for all x,y € V. 

3. Triangle inequality: d(x, z) < d(x, y) + d(y, z) for all z, y,z E€ V. 


Remark. At first glance, the lists of properties of inner products and met- 
rics look very similar. However, by comparing Definition 3.3 with Defini- 
tion 3.6 we observe that (x, y) and d(a, y) behave in opposite directions. 
Very similar x and y will result in a large value for the inner product and 
a small value for the metric. © 


3.4 Angles and Orthogonality 


In addition to enabling the definition of lengths of vectors, as well as the 
distance between two vectors, inner products also capture the geometry 
of a vector space by defining the angle w between two vectors. We use 
the Cauchy-Schwarz inequality (3.17) to define angles w in inner prod- 
uct spaces between two vectors x, y, and this notion coincides with our 
intuition in R? and R3. Assume that « 4 0, y 4 0. Then 


Bu e (3.24) 
æli lly 


Therefore, there exists a unique w € [0, 7], illustrated in Figure 3.4, with 


(x,y) 


ma i (3.25) 
læll lly 


cos W = 
The number w is the angle between the vectors x and y. Intuitively, the 
angle between two vectors tells us how similar their orientations are. For 
example, using the dot product, the angle between x and y = 4g, i.e., y 
is a scaled version of æ, is 0: Their orientation is the same. 
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Example 3.6 (Angle between Vectors) 

Let us compute the angle between x = [1,1]' € R? and y = [1,2]' € R?; 
see Figure 3.5, where we use the dot product as the inner product. Then 
we get 


(ey) _ xy _ 3 
ylz, x£) ly,y) væ'zsry'y v0’ 
) = 0.32 rad, which 





COS W = 





(3.26) 


and the angle between the two vectors is arccos( 
corresponds to about 18°. 


sk 
oO 


A key feature of the inner product is that it also allows us to characterize 
vectors that are orthogonal. 


Definition 3.7 (Orthogonality). Two vectors æ and y are orthogonal if and 
only if (x, y) = 0, and we write x L y. If additionally ||x|| = 1 = ||yl|, 
i.e., the vectors are unit vectors, then x and y are orthonormal. 


An implication of this definition is that the O-vector is orthogonal to 
every vector in the vector space. 


Remark. Orthogonality is the generalization of the concept of perpendic- 
ularity to bilinear forms that do not have to be the dot product. In our 
context, geometrically, we can think of orthogonal vectors as having a 
right angle with respect to a specific inner product. > 


Example 3.7 (Orthogonal Vectors) 





Consider two vectors x = [{1,1]',y = [-1,1]' € R?; see Figure 3.6. 
We are interested in determining the angle w between them using two 
different inner products. Using the dot product as the inner product yields 
an angle w between x and y of 90°, such that x L y. However, if we 
choose the inner product 


2 0 
(x,y) =a" f ify. (3.27) 
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we get that the angle w between g and y is given by 


(zy) __1 
lælllull 3 


and x and y are not orthogonal. Therefore, vectors that are orthogonal 
with respect to one inner product do not have to be orthogonal with re- 
spect to a different inner product. 


cOosWw = => w 7 1.9lrad ~ 109.5°, (3.28) 


Definition 3.8 (Orthogonal Matrix). A square matrix A € R”*%” is an 
orthogonal matrix if and only if its columns are orthonormal so that 


AA'=I=A'A, (3.29) 


which implies that 
A'‘=A', (3.30) 


i.e., the inverse is obtained by simply transposing the matrix. 


Transformations by orthogonal matrices are special because the length 
of a vector x is not changed when transforming it using an orthogonal 
matrix A. For the dot product, we obtain 


| Ax]? = (Ax) (Ax) =a@' Al Ar =a'Ix=a2'x=|la|. (3.31) 


Moreover, the angle between any two vectors x, y, as measured by their 
inner product, is also unchanged when transforming both of them using 
an orthogonal matrix A. Assuming the dot product as the inner product, 
the angle of the images Aw and Ay is given as 


T TAT T 
Cos w = (Aw) (Ay) = PA AY T (3.32) 


jAællAul SA Aan A Aa ell 


which gives exactly the angle between æ and y. This means that orthog- 
onal matrices A with A' = A`" preserve both angles and distances. It 
turns out that orthogonal matrices define transformations that are rota- 
tions (with the possibility of flips). In Section 3.9, we will discuss more 
details about rotations. 





3.5 Orthonormal Basis 


In Section 2.6.1, we characterized properties of basis vectors and found 
that in an n-dimensional vector space, we need n basis vectors, i.e., n 
vectors that are linearly independent. In Sections 3.3 and 3.4, we used 
inner products to compute the length of vectors and the angle between 
vectors. In the following, we will discuss the special case where the basis 
vectors are orthogonal to each other and where the length of each basis 
vector is 1. We will call this basis then an orthonormal basis. 
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Let us introduce this more formally. 


Definition 3.9 (Orthonormal Basis). Consider an n-dimensional vector 
space V anda basis {b,,...,b,,} of V. If 


(b;,b;) = 1 (3.34) 
for all 7,7 = 1,..., then the basis is called an orthonormal basis (ONB). 


If only (3.33) is satisfied, then the basis is called an orthogonal basis. Note 
that (3.34) implies that every basis vector has length/norm 1. 


Recall from Section 2.6.1 that we can use Gaussian elimination to find a 
basis for a vector space spanned by a set of vectors. Assume we are given 
a set {b,,...,6,} of non-orthogonal and unnormalized basis vectors. We 


concatenate them into a matrix B = (by, ...,6,] and apply Gaussian elim- 


ination to the augmented matrix (Section 2.3.2) [BB |B] to obtain an 
orthonormal basis. This constructive way to iteratively build an orthonor- 
mal basis {b,,...,6,,} is called the Gram-Schmidt process (Strang, 2003). 


Example 3.8 (Orthonormal Basis) 

The canonical/standard basis for a Euclidean vector space R” is an or- 

thonormal basis, where the inner product is the dot product of vectors. 
In R?, the vectors 


Lape Toi 
pan nee 3.35 
all &= lA oe 
form an orthonormal basis since b} b, = 0 and ||b,|| = 1 = ||b9||. 


We will exploit the concept of an orthonormal basis in Chapter 12 and 
Chapter 10 when we discuss support vector machines and principal com- 
ponent analysis. 


3.6 Orthogonal Complement 


Having defined orthogonality, we will now look at vector spaces that are 
orthogonal to each other. This will play an important role in Chapter 10, 
when we discuss linear dimensionality reduction from a geometric per- 
spective. 

Consider a D-dimensional vector space V and an M-dimensional sub- 


orthonormal basis 
ONB 
orthogonal basis 


space U C V. Then its orthogonal complement U+ is a (D—M)-dimensional orthogonal 


subspace of V and contains all vectors in V that are orthogonal to every 
vector in U. Furthermore, U N U+ = {0} so that any vector x € V can be 
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Figure 3.7 A plane 


Uina €3 
three-dimensional 
vector space can be w 
described by its 
normal vector, 
which spans its €2 
orthogonal 
complement U+. 
ej 
U 
uniquely decomposed into 
M D-M 
1 
t= JS àmbm+ X Wb, Am WER, (3.36) 
m=1 j=l 
where (b,,...,by,) is a basis of U and (by,..., bp_,,) is a basis of U+. 


Therefore, the orthogonal complement can also be used to describe a 
plane U (two-dimensional subspace) in a three-dimensional vector space. 
More specifically, the vector w with ||w|| = 1, which is orthogonal to the 
plane U, is the basis vector of U+. Figure 3.7 illustrates this setting. All 
vectors that are orthogonal to w must (by construction) lie in the plane 

normal vector U. The vector w is called the normal vector of U. 

Generally, orthogonal complements can be used to describe hyperplanes 

in n-dimensional vector and affine spaces. 


3.7 Inner Product of Functions 


Thus far, we looked at properties of inner products to compute lengths, 
angles and distances. We focused on inner products of finite-dimensional 
vectors. In the following, we will look at an example of inner products of 
a different type of vectors: inner products of functions. 

The inner products we discussed so far were defined for vectors with a 
finite number of entries. We can think of a vector a € R” as a function 
with n function values. The concept of an inner product can be generalized 
to vectors with an infinite number of entries (countably infinite) and also 
continuous-valued functions (uncountably infinite). Then the sum over 
individual components of vectors (see Equation (3.5) for example) turns 
into an integral. 

An inner product of two functions u : R —> R and v : R —> R can be 
defined as the definite integral 


(u,v) =| u(a)u(x)dx (3.37) 
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for lower and upper limits a,b < œo, respectively. As with our usual inner 
product, we can define norms and orthogonality by looking at the inner 
product. If (3.37) evaluates to 0, the functions u and v are orthogonal. To 
make the preceding inner product mathematically precise, we need to take 
care of measures and the definition of integrals, leading to the definition of 
a Hilbert space. Furthermore, unlike inner products on finite-dimensional 
vectors, inner products on functions may diverge (have infinite value). All 
this requires diving into some more intricate details of real and functional 
analysis, which we do not cover in this book. 


Example 3.9 (Inner Product of Functions) 

If we choose u = sin(x) and v = cos(x), the integrand f(x) = u(x)v(x) 
of (3.37), is shown in Figure 3.8. We see that this function is odd, i.e., 
f(—«x) = —f(«). Therefore, the integral with limits a = —7,b = a of this 
product evaluates to 0. Therefore, sin and cos are orthogonal functions. 


Remark. It also holds that the collection of functions 
{1, cos(x), cos(2z), cos(3x),... } (3.38) 


is orthogonal if we integrate from —7z to 7, i.e., any pair of functions are 
orthogonal to each other. The collection of functions in (3.38) spans a 
large subspace of the functions that are even and periodic on |[—7, 7), and 
projecting functions onto this subspace is the fundamental idea behind 
Fourier series. © 


In Section 6.4.6, we will have a look at a second type of unconventional 
inner products: the inner product of random variables. 


3.8 Orthogonal Projections 


Projections are an important class of linear transformations (besides rota- 
tions and reflections) and play an important role in graphics, coding the- 
ory, statistics and machine learning. In machine learning, we often deal 
with data that is high-dimensional. High-dimensional data is often hard 
to analyze or visualize. However, high-dimensional data quite often pos- 
sesses the property that only a few dimensions contain most information, 
and most other dimensions are not essential to describe key properties 
of the data. When we compress or visualize high-dimensional data, we 
will lose information. To minimize this compression loss, we ideally find 
the most informative dimensions in the data. As discussed in Chapter 1, 
data can be represented as vectors, and in this chapter, we will discuss 
some of the fundamental tools for data compression. More specifically, we 
can project the original high-dimensional data onto a lower-dimensional 
feature space and work in this lower-dimensional space to learn more 
about the dataset and extract relevant patterns. For example, machine 
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Figure 3.8 f(x) = 
sin(x) cos(x). 


0.5) 


0.0: 


sin(x) cos(x) 


“Feature” is a 
common expression 
for data 
representation. 


Figure 3.9 
Orthogonal 
projection (orange 
dots) of a 
two-dimensional 
dataset (blue dots) 
onto a 
one-dimensional 
subspace (straight 
line). 


projection 


projection matrix 


line 
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learning algorithms, such as principal component analysis (PCA) by Pear- 
son (1901) and Hotelling (1933) and deep neural networks (e.g., deep 
auto-encoders (Deng et al., 2010)), heavily exploit the idea of dimension- 
ality reduction. In the following, we will focus on orthogonal projections, 
which we will use in Chapter 10 for linear dimensionality reduction and 
in Chapter 12 for classification. Even linear regression, which we discuss 
in Chapter 9, can be interpreted using orthogonal projections. For a given 
lower-dimensional subspace, orthogonal projections of high-dimensional 
data retain as much information as possible and minimize the difference/ 
error between the original data and the corresponding projection. An il- 
lustration of such an orthogonal projection is given in Figure 3.9. Before 
we detail how to obtain these projections, let us define what a projection 
actually is. 


Definition 3.10 (Projection). Let V be a vector space and U C V a 
subspace of V. A linear mapping 7 : V — U is called a projection if 
T? =TOTNT =T. 

Since linear mappings can be expressed by transformation matrices (see 
Section 2.7), the preceding definition applies equally to a special kind 
of transformation matrices, the projection matrices P„, which exhibit the 
property that P? = P,. 

In the following, we will derive orthogonal projections of vectors in the 
inner product space (R”,(-,-)) onto subspaces. We will start with one- 
dimensional subspaces, which are also called lines. If not mentioned oth- 
erwise, we assume the dot product (æ, y) = æ! y as the inner product. 


3.8.1 Projection onto One-Dimensional Subspaces (Lines) 


Assume we are given a line (one-dimensional subspace) through the ori- 
gin with basis vector b € R”. The line is a one-dimensional subspace 
U C R” spanned by b. When we project x € R” onto U, we seek the 
vector my(æx) € U that is closest to x. Using geometric arguments, let 
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a b 


(b) Projection of a two-dimensional vector 
æ with ||æ|| = 1 onto a one-dimensional 
subspace spanned by b. 


(a) Projection of x € R? onto a subspace U 
with basis vector b. 


us characterize some properties of the projection my (æ) (Figure 3.10(a) 
serves as an illustration): 


= The projection my(æ) is closest to x, where “closest” implies that the 
distance ||x — my (æ)|| is minimal. It follows that the segment zy (a) — x 
from Ty (x) to x is orthogonal to U, and therefore the basis vector b of 
U. The orthogonality condition yields (my (a) — a,b) = 0 since angles 
between vectors are defined via the inner product. 

= The projection my (x) of a onto U must be an element of U and, there- 
fore, a multiple of the basis vector b that spans U. Hence, ty (ax) = Ab, 
for some à E R. 


In the following three steps, we determine the coordinate A, the projection 
ty (a) € U, and the projection matrix P, that maps any x € R” onto U: 


1. Finding the coordinate A. The orthogonality condition yields 


(x — ru(z), b) = 0 "5™ (z — rb, b) = 0. (3.39) 


We can now exploit the bilinearity of the inner product and arrive at 


(x,b) (b,x) 


(2,b) —A(b,b) =0 <> A= B= Te 


(3.40) 








In the last step, we exploited the fact that inner products are symmet- 
ric. If we choose (-, -) to be the dot product, we obtain 


\_ bw _ ble 


b'b ||? 





(3.41) 


If ||b|| = 1, then the coordinate À of the projection is given by b' x. 
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Figure 3.10 
Examples of 
projections onto 
one-dimensional 
subspaces. 


is then the 
coordinate of zy (a) 
with respect to b. 


With a general inner 
product, we get 

à = (a,b) if 

ibl] = 1. 


The horizontal axis 
is a one-dimensional 
subspace. 


Projection matrices 
are always 
symmetric. 
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2. Finding the projection point my (a) € U. Since my (x) = Ab, we imme- 
diately obtain with (3.40) that 
(a, b) ba 


= )\b= b= b 3.42 
CO eae Be ae 








where the last equality holds for the dot product only. We can also 
compute the length of my (x) by means of Definition 3.1 as 


It (a) || = ||Abl] = [Al Ibl] . (3.43) 


Hence, our projection is of length |A| times the length of b. This also 
adds the intuition that A is the coordinate of my (x) with respect to the 
basis vector b that spans our one-dimensional subspace U. 


If we use the dot product as an inner product, we get 


lloll 
lll? 


(3.42) |b" | 
|| ||? 








(3.25) 
loll = | cos |læll llb] 


[Tu (æ)|| 


= | cosw]| ||a|| . 
(3.44) 


Here, w is the angle between x and b. This equation should be familiar 
from trigonometry: If ||æ|| = 1, then æ lies on the unit circle. It follows 
that the projection onto the horizontal axis spanned by b is exactly 
cos w, and the length of the corresponding vector my (a) = |cos w|. An 
illustration is given in Figure 3.10(b). 

3. Finding the projection matrix P,,. We know that a projection is a lin- 
ear mapping (see Definition 3.10). Therefore, there exists a projection 
matrix P,, such that my(x) = P,a. With the dot product as inner 
product and 


b'a  bb' 
Ty (a) = Ab = bX = b—_ = —_—_za, (3.45) 
k bl? |b]? 
we immediately see that 
bb! 
= (3.46) 
|b|? 


Note that bb" (and, consequently, P,,) is a symmetric matrix (of rank 
1), and ||b||? = (b, b) is a scalar. 


The projection matrix P, projects any vector x € R” onto the line through 
the origin with direction b (equivalently, the subspace U spanned by b). 


Remark. The projection my(Œx) € R” is still an n-dimensional vector and 
not a scalar. However, we no longer require n coordinates to represent the 
projection, but only a single one if we want to express it with respect to 
the basis vector b that spans the subspace U: A. © 
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x 
£ — Ty (x) 
U 
be 
Ty (x) 
0 bı 


Example 3.10 (Projection onto a Line) 
Find the projection matrix P, onto the line through the origin spanned 
by b = [1 2 e b is a direction and a basis of the one-dimensional 
subspace (line through origin). 

With (3.46), we obtain 


1 Ioa? 
bb' 1 1 
P,=—-=- |2| [1 2 Ee 4 4 

b'b O15 


244 


(3.47) 


Let us now choose a particular x and see whether it lies in the subspace 
spanned by b. For x = [1 1 1] i the projection is 


1 E 1 5 1 
me =a E= 7 DASA 3 10| € span| |2|]. (3.48) 
2AA 10 2 


Note that the application of P, to my(æ) does not change anything, i.e., 
P,,7y(x) = ty (a). This is expected because according to Definition 3.10, 
we know that a projection matrix P,, satisfies P?2 = P,,x for all x. 


Remark. With the results from Chapter 4, we can show that my(a#) is an 
eigenvector of P.,, and the corresponding eigenvalue is 1. 9 


3.8.2 Projection onto General Subspaces 


In the following, we look at orthogonal projections of vectors x € R” 
onto lower-dimensional subspaces U C R” with dim(U) = m > 1. An 
illustration is given in Figure 3.11. 

Assume that (b1, ... , bm ) is an ordered basis of U. Any projection my (æ) 
onto U is necessarily an element of U. Therefore, they can be represented 
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Figure 3.11 
Projection onto a 
two-dimensional 
subspace U with 
basis bı, b2. The 
projection my (æ) of 
æ € R? onto U can 
be expressed as a 
linear combination 
of bı, b2 and the 
displacement vector 
x — Ty (a) is 
orthogonal to both 
bı and bə. 


If U is given by a set 
of spanning vectors, 
which are not a 
basis, make sure 
you determine a 
basis b1,...,bm 
before proceeding. 
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as linear combinations of the basis vectors b,,...,6,, of U, such that 
The basis vectors Ty (a) = ys A;bi. 
form the columns of As in the 1D case, we follow a three-step procedure to find the projec- 


BeR”*™, where 


B= fbi, bn] on my (a) and the projection matrix P,,: 
1. Find the coordinates \,,..., A, of the projection (with respect to the 


basis of U), such that the linear combination 
my (x) = 50 Ab; = BA, (3.49) 
w=1 


B = |bi,... bm] E R”™, A=]lAi,-<-; An] ER”, (3.50) 


is closest to x € R”. As in the 1D case, “closest” means “minimum 
distance”, which implies that the vector connecting my(a) € U and 
x € R” must be orthogonal to all basis vectors of U. Therefore, we 
obtain m simultaneous conditions (assuming the dot product as the 
inner product) 








(b,, 2 — my (x)) = bi (x — tu (a)) =0 (3.51) 
(bm, @ — Ty(#)) = b} (x — ty (#)) = 0 (3.52) 
which, with ty (ax) = BA, can be written as 
b| (a — BA) =0 (3.53) 
b| (a — BA) =0 (3.54) 
such that we obtain a homogeneous linear equation system 
bi 
: | |a-BA| =0 — B'(a- BrA)=0 (3.55) 
by, 
<> B'BX\=B'z. (3.56) 
normal equation The last expression is called normal equation. Since b;,...,b,, are a 


basis of U and, therefore, linearly independent, B' B € R”*™ jis reg- 
ular and can be inverted. This allows us to solve for the coefficients/ 
coordinates 


\=(B'B)'B'z. (3.57) 
pseudo-inverse The matrix (B'B)~'B' is also called the pseudo-inverse of B, which 
can be computed for non-square matrices B. It only requires that B' B 


is positive definite, which is the case if B is full rank. In practical ap- 
plications (e.g., linear regression), we often add a “jitter term” «I to 
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B' B to guarantee increased numerical stability and positive definite- 
ness. This “ridge” can be rigorously derived using Bayesian inference. 
See Chapter 9 for details. 

2. Find the projection 7y (a) € U. We already established that 7y(x) = 
BX. Therefore, with (3.57) 


Ty(2) = B(B'B)'B'e. (3.58) 


3. Find the projection matrix P,. From (3.58), we can immediately see 
that the projection matrix that solves Pa = my(ax) must be 


P,=B(B'B)'B'. (3.59) 


Remark. The solution for projecting onto general subspaces includes the 

1D case as a special case: If dim(U) = 1, then B'B € Risa scalar and 

we can rewrite the projection matrix in (3.59) P, = B(B'B)~'B' as 
BB" 


P, = 37m Which is exactly the projection matrix in (3.46). D 





Example 3.11 (Projection onto a Two-dimensional Subspace) 


1 0 6 
For a subspace U = spanj| |1|, |1|] C R? and x = |0| € R? find the 
lt 2 0 


coordinates A of x in terms of the subspace U, the projection point my (x) 
and the projection matrix P,,. 
First, we see that the generating set of U is a basis (linear indepen- 


1 0 
dence) and write the basis vectors of U into a matrix B= |1 1]. 
IEE? 
Second, we compute the matrix B' B and the vector B' æ as 
1 0 6 
TESI 3m3 L 6 
sapi Ji J-EJ 7-6: E-N 
ib 2 12 3 5 om> 0 0 
(3.60) 


Third, we solve the normal equation B' BX = B' x to find A: 


b l | = f TRA E - (3.61) 


Fourth, the projection 7y(a) of x onto U, i.e., into the column space of 
B, can be directly computed via 


5 
ru(w)=BrA=| 2]. (3.62) 
a 
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The corresponding projection error is the norm of the difference vector 
between the original vector and its projection onto U, i.e., 


\|z — tu (a)|| = f E 4 '| TR (3.63) 
Fifth, the projection matrix (for any æ € R?) is given by 
PeT! 
P= BBB) BB == | 2 2 2|. (3.64) 
-1 2 5 


To verify the results, we can (a) check whether the displacement vector 
Tu(a) — « is orthogonal to all basis vectors of U, and (b) verify that 
a P? (see Definition 3.10). 


Remark. The projections 7y(a) are still vectors in R” although they lie in 
an m-dimensional subspace U C R”. However, to represent a projected 
vector we only need the m coordinates ,,...,,, with respect to the 
basis vectors b4,..., bm of U. & 


Remark. In vector spaces with general inner products, we have to pay 
attention when computing angles and distances, which are defined by 
means of the inner product. ro 


Projections allow us to look at situations where we have a linear system 
Ax = b without a solution. Recall that this means that b does not lie in 
the span of A, i.e., the vector b does not lie in the subspace spanned by 
the columns of A. Given that the linear equation cannot be solved exactly, 
we can find an approximate solution. The idea is to find the vector in the 
subspace spanned by the columns of A that is closest to b, i.e., we compute 
the orthogonal projection of b onto the subspace spanned by the columns 
of A. This problem arises often in practice, and the solution is called the 
least-squares solution (assuming the dot product as the inner product) of 
an overdetermined system. This is discussed further in Section 9.4. Using 
reconstruction errors (3.63) is one possible approach to derive principal 
component analysis (Section 10.3). 


Remark. We just looked at projections of vectors æ onto a subspace U with 
basis vectors {b;,..., bp}. If this basis is an ONB, i.e., (3.33) and (3.34) 
are satisfied, the projection equation (3.58) simplifies greatly to 


Ty(2) = BB' z (3.65) 

since B' B = I with coordinates 
A=B'z. (3.66) 
This means that we no longer have to compute the inverse from (3.58), 


which saves computation time. ro 
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3.8.3 Gram-Schmidt Orthogonalization 


Projections are at the core of the Gram-Schmidt method that allows us to 


constructively transform any basis (b;,...,6,,) of an n-dimensional vector 
space V into an orthogonal/orthonormal basis (w;,...,u,) of V. This 
basis always exists (Liesen and Mehrmann, 2015) and span|b;,...,6,] = 
span|t,..., U,]. The Gram-Schmidt orthogonalization method iteratively 
constructs an orthogonal basis (w,,..., u,) from any basis (b;,...,b,,) of 
V as follows: 

Uy, i= bı (3.67) 

tip = bg = Taponjun: up abk); K= 2rth (3.68) 


In (3.68), the kth basis vector b, is projected onto the subspace spanned 
by the first k — 1 constructed orthogonal vectors u1,...,Up—1; see Sec- 
tion 3.8.2. This projection is then subtracted from b; and yields a vector 
u,, that is orthogonal to the (k — 1)-dimensional subspace spanned by 


U1,-..,Uz—1- Repeating this procedure for all n basis vectors b,,...,b, 
yields an orthogonal basis (u;,...,u,,) of V. If we normalize the ug, we 
obtain an ONB where ||u;|| = 1 fork = 1,...,n. 


Example 3.12 (Gram-Schmidt Orthogonalization) 


bs bz f U2 ' 
I I 
l { 
l l 
i i 
0 bı 0 Tspan[ui](b2) Ua 0 Tspan[ui](b2) Ua 
(a) Original non-orthogonal (b) First new basis vector (c) Orthogonal basis vectors wu 
basis vectors by, ba. uw, = 6b, and projection of b2 and uz = bz — Teepe til (b2). 
onto the subspace spanned by 
ui. 
Consider a basis (b4, b2) of R?, where 
2 1 
bı — i 5 bə — 1 ; (3.69) 


see also Figure 3.12(a). Using the Gram-Schmidt method, we construct an 
orthogonal basis (u1, u2) of R? as follows (assuming the dot product as 
the inner product): 





u := bı = A p (3.70) 
AW 
E (3.45), Uiu,, _ |1) J1 Ofi _ J0 
n E 
(3.71) 
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Gram-Schmidt 
orthogonalization 


Figure 3.12 
Gram-Schmidt 
orthogonalization. 
(a) non-orthogonal 
basis (b1, b2) of R?; 
(b) first constructed 
basis vector w1 and 
orthogonal 
projection of b2 
onto span[u1]; 

(c) orthogonal basis 
(u1, u2) of RÊ. 


Figure 3.13 
Projection onto an 
affine space. 

(a) original setting; 
(b) setting shifted 
by —ao so that 

x — xo can be 
projected onto the 
direction space U; 
(c) projection is 
translated back to 
zo + Ty (Œ — x0), 
which gives the final 
orthogonal 
projection 7z (æ). 
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A £o 





0 bı 0 bı 0 bı 
(a) Setting. (c) Add support point back in 


to get affine projection mz. 


(b) Reduce problem to pro- 
jection my onto vector sub- 
space. 


These steps are illustrated in Figures 3.12(b) and (c). We immediately see 
that w, and uz are orthogonal, i.e., u} us = 0. 


3.8.4 Projection onto Affine Subspaces 


Thus far, we discussed how to project a vector onto a lower-dimensional 
subspace U. In the following, we provide a solution to projecting a vector 
onto an affine subspace. 

Consider the setting in Figure 3.13 (a). We are given an affine space L = 
zo + U, where bı, bz are basis vectors of U. To determine the orthogonal 
projection 7,(a) of a onto L, we transform the problem into a problem 
that we know how to solve: the projection onto a vector subspace. In 
order to get there, we subtract the support point a, from x and from L, 
so that L — a = U is exactly the vector subspace U. We can now use the 
orthogonal projections onto a subspace we discussed in Section 3.8.2 and 
obtain the projection ty(x% — xo), which is illustrated in Figure 3.13(b). 
This projection can now be translated back into L by adding £o, such that 
we obtain the orthogonal projection onto an affine space L as 


TL(L) = £o + Ty (T — £o), (3.72) 


where zy(-) is the orthogonal projection onto the subspace U, i.e., the 
direction space of L; see Figure 3.13(c). 

From Figure 3.13, it is also evident that the distance of a from the affine 
space L is identical to the distance of x — ap from U, i.e., 


(3.73a) 
(3.73b) 


d(x, L) = ||a — mr (x)|| = |l@ — (ao + tu (a — a0) 
= d(x — %o, Ty (@ — %o)) = d(x — £o, U). 


We will use projections onto an affine subspace to derive the concept of 
a separating hyperplane in Section 12.1. 
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Original 






Rotated by 112.5° 





3.9 Rotations 


Length and angle preservation, as discussed in Section 3.4, are the two 
characteristics of linear mappings with orthogonal transformation matri- 
ces. In the following, we will have a closer look at specific orthogonal 
transformation matrices, which describe rotations. 

A rotation is a linear mapping (more specifically, an automorphism of 
a Euclidean vector space) that rotates a plane by an angle @ about the 
origin, i.e., the origin is a fixed point. For a positive angle 9 > 0, by com- 
mon convention, we rotate in a counterclockwise direction. An example is 
shown in Figure 3.14, where the transformation matrix is 


—0.38 


0.92 (3.74) 


R= | al : 


—0.38 


Important application areas of rotations include computer graphics and 
robotics. For example, in robotics, it is often important to know how to 
rotate the joints of a robotic arm in order to pick up or place an object, 
see Figure 3.15. 
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Figure 3.14 A 
rotation rotates 
objects in a plane 
about the origin. If 
the rotation angle is 
positive, we rotate 
counterclockwise. 


Figure 3.15 The 
robotic arm needs to 
rotate its joints in 
order to pick up 
objects or to place 
them correctly. 
Figure taken 

from (Deisenroth 

et al., 2015). 


rotation 


Figure 3.16 
Rotation of the 
standard basis in R? 
by an angle 0. 


rotation matrix 
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J" 





— sin e cos 0 


3.9.1 Rotations in R? 


0 
1 
the standard coordinate system in R?. We aim to rotate this coordinate 
system by an angle 0 as illustrated in Figure 3.16. Note that the rotated 
vectors are still linearly independent and, therefore, are a basis of R*. This 
means that the rotation performs a basis change. 

Rotations ® are linear mappings so that we can express them by a 
rotation matrix R(@). Trigonometry (see Figure 3.16) allows us to de- 
termine the coordinates of the rotated axes (the image of ©) with respect 
to the standard basis in R?. We obtain 


#le) = [igl oe) =| 


Therefore, the rotation matrix that performs the basis change into the 
rotated coordinates R(0) is given as 


Consider the standard basis fe, = fol , €&2 = | \ of R?, which defines 


cos 0 
sin 0 


ee (3.75) 


—sin | 


(3.76) 


RO) =(H(e:) Bea = a val 


sinô cos 


3.9.2 Rotations in R? 


In contrast to the IR? case, in R® we can rotate any two-dimensional plane 
about a one-dimensional axis. The easiest way to specify the general rota- 
tion matrix is to specify how the images of the standard basis e1, e2, e3 are 
supposed to be rotated, and making sure these images Re), Reo, Re; are 
orthonormal to each other. We can then obtain a general rotation matrix 
R by combining the images of the standard basis. 

To have a meaningful rotation angle, we have to define what “coun- 
terclockwise” means when we operate in more than two dimensions. We 
use the convention that a “counterclockwise” (planar) rotation about an 
axis refers to a rotation about an axis when we look at the axis “head on, 
from the end toward the origin”. In R°, there are therefore three (planar) 
rotations about the three standard basis vectors (see Figure 3.17): 
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= Rotation about the e,-axis 


1 0 0 
R,(0) = [®(e1) (e2) B(e3)]| = |0 cos? —sin@] . (3.77) 
0 sinf cosé 


Here, the e; coordinate is fixed, and the counterclockwise rotation is 
performed in the ese; plane. 


= Rotation about the e2-axis 


cos 0 sin 
R,(0) = 0 1 0 i (3.78) 
—sinð 0 cos 


If we rotate the e,e3 plane about the e; axis, we need to look at the ez 
axis from its “tip” toward the origin. 


= Rotation about the e3-axis 


cos —sin@ 0 
R3(6)=|sin®? cos@ Of. (3.79) 
0 0 1 


Figure 3.17 illustrates this. 


3.9.3 Rotations in n Dimensions 


The generalization of rotations from 2D and 3D to n-dimensional Eu- 
clidean vector spaces can be intuitively described as fixing n — 2 dimen- 
sions and restrict the rotation to a two-dimensional plane in the n-dimen- 
sional space. As in the three-dimensional case, we can rotate any plane 
(two-dimensional subspace of R”). 


Definition 3.11 (Givens Rotation). Let V be an n-dimensional Euclidean 
vector space and ® : V — V an automorphism with transformation ma- 
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Figure 3.17 
Rotation of a vector 
(gray) in R3 by an 
angle 8 about the 
e3-axis. The rotated 
vector is shown in 
blue. 


Givens rotation 
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trix 
Ia 0 as nats 0 
| 0 cosé (0) —sind 0 | 
R;;(9) := | 0 0 Tjina 0 0 | ER", (3.80) 
0O sin 0 cos 8 0 | 
0 0 Iaj 


for 1 <i < j < nand0 € R. Then R;;(0) is called a Givens rotation. 
Essentially, R;;(0) is the identity matrix I,, with 


ri =cosh, riy =—sinĝ, rj =sinð0, rjj = cos. (3.81) 


In two dimensions (i.e., n = 2), we obtain (3.76) as a special case. 


3.9.4 Properties of Rotations 


Rotations exhibit a number of useful properties, which can be derived by 
considering them as orthogonal matrices (Definition 3.8): 


« Rotations preserve distances, i.e., ||a—y|| = || Ro(a)— Re(y)||. In other 
words, rotations leave the distance between any two points unchanged 
after the transformation. 

Rotations preserve angles, i.e., the angle between Ryx and Roy equals 
the angle between a and y. 

Rotations in three (or more) dimensions are generally not commuta- 
tive. Therefore, the order in which rotations are applied is important, 
even if they rotate about the same point. Only in two dimensions vector 
rotations are commutative, such that R(¢)R(?) = R(?)R(¢) for all 
@, 6 € [0, 27). They form an Abelian group (with multiplication) only if 
they rotate about the same point (e.g., the origin). 


3.10 Further Reading 


In this chapter, we gave a brief overview of some of the important concepts 
of analytic geometry, which we will use in later chapters of the book. 
For a broader and more in-depth overview of some of the concepts we 
presented, we refer to the following excellent books: Axler (2015) and 
Boyd and Vandenberghe (2018). 

Inner products allow us to determine specific bases of vector (sub)spaces, 
where each vector is orthogonal to all others (orthogonal bases) using the 
Gram-Schmidt method. These bases are important in optimization and 
numerical algorithms for solving linear equation systems. For instance, 
Krylov subspace methods, such as conjugate gradients or the generalized 
minimal residual method (GMRES), minimize residual errors that are or- 
thogonal to each other (Stoer and Burlirsch, 2002). 

In machine learning, inner products are important in the context of 
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kernel methods (Schélkopf and Smola, 2002). Kernel methods exploit the 
fact that many linear algorithms can be expressed purely by inner prod- 
uct computations. Then, the “kernel trick” allows us to compute these 
inner products implicitly in a (potentially infinite-dimensional) feature 
space, without even knowing this feature space explicitly. This allowed the 
“non-linearization” of many algorithms used in machine learning, such as 
kernel-PCA (Schélkopf et al., 1997) for dimensionality reduction. Gaus- 
sian processes (Rasmussen and Williams, 2006) also fall into the category 
of kernel methods and are the current state of the art in probabilistic re- 
gression (fitting curves to data points). The idea of kernels is explored 
further in Chapter 12. 

Projections are often used in computer graphics, e.g., to generate shad- 
ows. In optimization, orthogonal projections are often used to (iteratively) 
minimize residual errors. This also has applications in machine learning, 
e.g., in linear regression where we want to find a (linear) function that 
minimizes the residual errors, i.e., the lengths of the orthogonal projec- 
tions of the data onto the linear function (Bishop, 2006). We will investi- 
gate this further in Chapter 9. PCA (Pearson, 1901; Hotelling, 1933) also 
uses projections to reduce the dimensionality of high-dimensional data. 
We will discuss this in more detail in Chapter 10. 


©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020). 


96 Analytic Geometry 


Exercises 
3.1 Show that (.,-) defined for all # = [x1,22]' € R? and y = [y1, y2]! € R? by 


(@, y) = 21y1 — (f1y2 + cay) + 2(xey2) 


is an inner product. 
3.2 Consider R? with (.,-) defined for all æ and y in R? as 


wya h S| 





2 
se 
=A 
Is (-,-) an inner product? 
3.3 Compute the distance between 
1 —1 
z= |2|, y= |-—1 
3 0 
using 
a. (2, y) :=aly 
i 2 1 0 
b. (x,y) :=xz Ay, A:=]|1 3 -1 
0 -1 2 


3.4 Compute the angle between 


using 


a. (@,y):= a2 y 
b. (@,y):=a' By, B:= F | 





1 3 


3.5 Consider the Euclidean vector space R° with the dot product. A subspace 
U C R? and æ € RŠ are given by 


E ge) te E 

—1 —3 4 —3 —9 
Qe Les 
0 —1 
2 2 


U = span[ 


= Ne 
NO oO 
A 


a. Determine the orthogonal projection zy (a) of x onto U 
b. Determine the distance d(a, U) 


3.6 Consider R? with the inner product 


2 > ds 0 
(a, y) =a@'|1 2 -lly. 
0 -1 2 


Furthermore, we define e1, e2, e3 as the standard/canonical basis in R?. 
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a. Determine the orthogonal projection zy (e2) of e2 onto 
U = span[e1, e3] . 


Hint: Orthogonality is defined through the inner product. 
b. Compute the distance d(e2, U). 
c. Draw the scenario: standard basis vectors and ry (e2) 


3.7 Let V be a vector space and m an endomorphism of V. 
a. Prove that ~ is a projection if and only if idy — ~ is a projection, where 
idy is the identity endomorphism on V. 


b. Assume now that z is a projection. Calculate Im(idy —7) and ker(idy —7) 
as a function of Im(z) and ker(z). 


3.8 Using the Gram-Schmidt method, turn the basis B = (bi,b2) of a two- 
dimensional subspace U C R into an ONB C = (c1, c2) of U, where 


1 —1 
bı := |1|, b2:= | 2 
1 (0) 
3.9 Letn € WN and let x,...,2n > 0 be n positive real numbers so that zı + 


... + £n = 1. Use the Cauchy-Schwarz inequality and show that 


a. Xin a 2 z 
b. Yai 7 >n 
Hint: Think about the dot product on R”. Then, choose specific vectors 
x,y € R” and apply the Cauchy-Schwarz inequality. 
3.10 Rotate the vectors 


by 30°. 
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4 





Matrix Decompositions 


In Chapters 2 and 3, we studied ways to manipulate and measure vectors, 
projections of vectors, and linear mappings. Mappings and transforma- 
tions of vectors can be conveniently described as operations performed by 
matrices. Moreover, data is often represented in matrix form as well, e.g., 
where the rows of the matrix represent different people and the columns 
describe different features of the people, such as weight, height, and socio- 
economic status. In this chapter, we present three aspects of matrices: how 
to summarize matrices, how matrices can be decomposed, and how these 
decompositions can be used for matrix approximations. 

We first consider methods that allow us to describe matrices with just 
a few numbers that characterize the overall properties of matrices. We 
will do this in the sections on determinants (Section 4.1) and eigenval- 
ues (Section 4.2) for the important special case of square matrices. These 
characteristic numbers have important mathematical consequences and 
allow us to quickly grasp what useful properties a matrix has. From here 
we will proceed to matrix decomposition methods: An analogy for ma- 
trix decomposition is the factoring of numbers, such as the factoring of 
21 into prime numbers 7 - 3. For this reason matrix decomposition is also 
often referred to as matrix factorization. Matrix decompositions are used 
to describe a matrix by means of a different representation using factors 
of interpretable matrices. 

We will first cover a square-root-like operation for symmetric, positive 
definite matrices, the Cholesky decomposition (Section 4.3). From here 
we will look at two related methods for factorizing matrices into canoni- 
cal forms. The first one is known as matrix diagonalization (Section 4.4), 
which allows us to represent the linear mapping using a diagonal trans- 
formation matrix if we choose an appropriate basis. The second method, 
singular value decomposition (Section 4.5), extends this factorization to 
non-square matrices, and it is considered one of the fundamental concepts 
in linear algebra. These decompositions are helpful, as matrices represent- 
ing numerical data are often very large and hard to analyze. We conclude 
the chapter with a systematic overview of the types of matrices and the 
characteristic properties that distinguish them in the form of a matrix tax- 
onomy (Section 4.7). 

The methods that we cover in this chapter will become important in 
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Invertibility sedin 





ur pəsn 
ur pəsn 


Chapter 6 
Probability 
& distributions 






constructs 





Diagonalization 


Eigenvectors 





wn 
Ul pasn (2) 


Chapter 10 
Dimensionality 
reduction 


both subsequent mathematical chapters, such as Chapter 6, but also in 
applied chapters, such as dimensionality reduction in Chapters 10 or den- 
sity estimation in Chapter 11. This chapter’s overall structure is depicted 
in the mind map of Figure 4.1. 


4.1 Determinant and Trace 


Determinants are important concepts in linear algebra. A determinant is 
a mathematical object in the analysis and solution of systems of linear 
equations. Determinants are only defined for square matrices A € R”*”, 
i.e., matrices with the same number of rows and columns. In this book, 
we write the determinant as det(A) or sometimes as |A| so that 


Qi1 Q12 Gin 
Q21 Q22 QA2n 

det(A) = (4.1) 
Qni an2 Ann 


The determinant of a square matrix A € R”*” is a function that maps A 
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Figure 4.1 A mind 
map of the concepts 
introduced in this 
chapter, along with 
where they are used 
in other parts of the 
book. 


The determinant 
notation |A| must 
not be confused 
with the absolute 
value. 


determinant 
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onto a real number. Before providing a definition of the determinant for 
general n x n matrices, let us have a look at some motivating examples, 
and define determinants for some special matrices. 


Example 4.1 (Testing for Matrix Invertibility) 
Let us begin with exploring if a square matrix A is invertible (see Sec- 
tion 2.2.2). For the smallest cases, we already know when a matrix 
is invertible. If A is a 1 x 1 matrix, i.e., it is a scalar number, then 
A=a = A‘ =1.Thusa+ =1 holds, if and only if a 4 0. 

For 2 x 2 matrices, by the definition of the inverse (Definition 2.3), we 
know that AA! = I. Then, with (2.24), the inverse of A is 


Ac ae | Q22 a (4.2) 
Q11022 — Q12Q21 |7021 11 
Hence, A is invertible if and only if 
41422 — 12021 É 0. (4.3) 
This quantity is the determinant of A € R?*?, i.e., 


a a 
det(A) = a nE 11422 — 12021 . (4.4) 


Q21 Q22 


Example 4.1 points already at the relationship between determinants 
and the existence of inverse matrices. The next theorem states the same 
result for n x n matrices. 


Theorem 4.1. For any square matrix A € R”*” it holds that A is invertible 
if and only if det(A) # 0. 


We have explicit (closed-form) expressions for determinants of small 
matrices in terms of the elements of the matrix. For n = 1, 


det(A) = det(a1ı) = Q11. (4.5) 
For n = 2, 
det(A) — Aii a = 411422 — 412421 , (4.6) 
Q21 Q22 








which we have observed in the preceding example. 
For n = 3 (known as Sarrus’ rule), 


Q11 G12 413 
Q21 G22 23) = 411422033 + A21432013 + 31412023 (4.7) 
431 432 433 


— 431422413 — G11432423 — 421012433 « 
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For a memory aid of the product terms in Sarrus’ rule, try tracing the 
elements of the triple products in the matrix. 

We call a square matrix T an upper-triangular matrix if T;; = 0 for 
i > j, i.e., the matrix is zero below its diagonal. Analogously, we define a 
lower-triangular matrix as a matrix with zeros above its diagonal. For a tri- 
angular matrix T € R”*”, the determinant is the product of the diagonal 
elements, i.e., 

n 


det(T) = | | Ta. 


wl 


(4.8) 


Example 4.2 (Determinants as Measures of Volume) 

The notion of a determinant is natural when we consider it as a mapping 
from a set of n vectors spanning an object in R”. It turns out that the de- 
terminant det(A) is the signed volume of an n-dimensional parallelepiped 
formed by columns of the matrix A. 

For n = 2, the columns of the matrix form a parallelogram; see Fig- 
ure 4.2. As the angle between vectors gets smaller, the area of a parallel- 
ogram shrinks, too. Consider two vectors b, g that form the columns of a 
matrix A = |b, g]. Then, the absolute value of the determinant of A is the 
area of the parallelogram with vertices 0, b, g, b + g. In particular, if b, g 
are linearly dependent so that b = Ag for some A € R, they no longer 
form a two-dimensional parallelogram. Therefore, the corresponding area 
is 0. On the contrary, if b, g are linearly independent and are multiples of 


b 
J and 


the canonical basis vectors e,, e2 then they can be written as b = | 
0 

The sign of the determinant indicates the orientation of the spanning 
vectors b, g with respect to the standard basis (e4, e2). In our figure, flip- 
ping the order to g, b swaps the columns of A and reverses the orientation 
of the shaded area. This becomes the familiar formula: area = height x 
length. This intuition extends to higher dimensions. In R°, we consider 
three vectors r,b,g € R? spanning the edges of a parallelepiped, i.e., a 
solid with faces that are parallel parallelograms (see Figure 4.3). The ab- 


g= h , and the determinant is 





0 
= bg — 0 = bg. 
g g g 





solute value of the determinant of the 3 x 3 matrix |r, b, g] is the volume 
of the solid. Thus, the determinant acts as a function that measures the 
signed volume formed by column vectors composed in a matrix. 

Consider the three linearly independent vectors r, g, b € R? given as 


2 6 1 
P=|O0ll, oE (4.9) 
—8 0 —1 
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upper-triangular 
matrix 


lower-triangular 
matrix 


The determinant is 
the signed volume 
of the parallelepiped 
formed by the 
columns of the 
matrix. 

Figure 4.2 The area 
of the parallelogram 
(shaded region) 
spanned by the 
vectors b and g is 


|det([b, g])l. 


Jy 


Figure 4.3 The 
volume of the 
parallelepiped 
(shaded volume) 
spanned by vectors 
r,b,g is 

|det([r, b, g))l. 


M 


r 


o 


The sign of the 
determinant 
indicates the 
orientation of the 
spanning vectors. 


Laplace expansion 


det(Ax,;) is called 
a minor and 
(—1)*+9 det(Ag,)) 
a cofactor. 
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Writing these vectors as the columns of a matrix 


mB 6) Al 
Ash aae a A (4.10) 
=e) 0 =] 


allows us to compute the desired volume as 


V = |det(A)| = 186. (4.11) 


Computing the determinant of an n x n matrix requires a general algo- 
rithm to solve the cases for n > 3, which we are going to explore in the fol- 
lowing. Theorem 4.2 below reduces the problem of computing the deter- 
minant of an n xn matrix to computing the determinant of (n—1) x (n—1) 
matrices. By recursively applying the Laplace expansion (Theorem 4.2), 
we can therefore compute determinants of n x n matrices by ultimately 
computing determinants of 2 x 2 matrices. 


Theorem 4.2 (Laplace Expansion). Consider a matrix A € R”"*". Then, 
forall j =1,...,”: 


1. Expansion along column j 
det(A) = SOH an det (A,,;) - (4.12) 
k=1 
2. Expansion along row j 
det(A) = ; (—1)**a,;, det(A;,) (4.13) 
k=1 
Here A, ; € R(@—-)*—) is the submatrix of A that we obtain when delet- 
ing row k and column j. 


Example 4.3 (Laplace Expansion) 
Let us compute the determinant of 


OE 
Am aE (4.14) 
0 0 1 
using the Laplace expansion along the first row. Applying (4.13) yields 
OE 
sO E E b i 
0 1 
0 0 1 (4.15) 
3 2 S 
ap lar, SN 
+ (-1) 2/5 i+ 1) a Als 
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We use (4.6) to compute the determinants of all 2 x 2 matrices and obtain 


det(A) = 1(1 — 0) — 2(3 — 0) + 3(0—0) = —5. (4.16) 





For completeness we can compare this result to computing the determi- 
nant using Sarrus’ rule (4.7): 


det(A) = 1-1-143-0-3+0-2-2—0-1-3—1-0-2—3-2-1 = 1-6 = —5. (4.17) 


For A € R”*” the determinant exhibits the following properties: 


The determinant of a matrix product is the product of the corresponding 

determinants, det(AB) = det(A)det(B). 

Determinants are invariant to transposition, i.e., det(A) = det(A'). 

If A is regular (invertible), then det(A~') = ay 

Similar matrices (Definition 2.22) possess the same determinant. There- 

fore, for a linear mapping @ : V —> V all transformation matrices Ag 

of ® have the same determinant. Thus, the determinant is invariant to 

the choice of basis of a linear mapping. 

= Adding a multiple of a column/row to another one does not change 
det (A). 

= Multiplication of a column/row with à € R scales det(A) by A. In 
particular, det(A A) = à” det(A). 

= Swapping two rows/columns changes the sign of det(A). 


Because of the last three properties, we can use Gaussian elimination (see 
Section 2.1) to compute det(A) by bringing A into row-echelon form. 
We can stop Gaussian elimination when we have A in a triangular form 
where the elements below the diagonal are all 0. Recall from (4.8) that the 
determinant of a triangular matrix is the product of the diagonal elements. 


Theorem 4.3. A square matrix A € R”*” has det(A) 4 0 if and only if 
rk(A) = n. In other words, A is invertible if and only if it is full rank. 


When mathematics was mainly performed by hand, the determinant 
calculation was considered an essential way to analyze matrix invertibil- 
ity. However, contemporary approaches in machine learning use direct 
numerical methods that superseded the explicit calculation of the deter- 
minant. For example, in Chapter 2, we learned that inverse matrices can 
be computed by Gaussian elimination. Gaussian elimination can thus be 
used to compute the determinant of a matrix. 

Determinants will play an important theoretical role for the following 
sections, especially when we learn about eigenvalues and eigenvectors 
(Section 4.2) through the characteristic polynomial. 


Definition 4.4. The trace of a square matrix A € R”*” is defined as 
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trace 


The trace is 
invariant under 
cyclic permutations. 


characteristic 
polynomial 
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tr(A) := So au, (4.18) 
i=1 


i.e. , the trace is the sum of the diagonal elements of A. 
The trace satisfies the following properties: 


a tr(A + B) = tr(A) + tr(B) for A, B € R"*” 
" tr(aA) = atr(A),a € R for A € R"™™” 

a tr(I„) =n 

= tr(AB) = tr(BA) for Ac R"™*, Be R**” 


It can be shown that only one function satisfies these four properties to- 
gether — the trace (Gohberg et al., 2012). 

The properties of the trace of matrix products are more general. Specif- 
ically, the trace is invariant under cyclic permutations, i.e., 


tr(AKL) = tr(KLA) (4.19) 


for matrices A € R°**, K € R**', L € R'**. This property generalizes to 
products of an arbitrary number of matrices. As a special case of (4.19), it 
follows that for two vectors æ, y € R” 


tr(xy') =tr(y'x)=y'£zER. (4.20) 


Given a linear mapping ® : V — V, where V is a vector space, we 
define the trace of this map by using the trace of matrix representation 
of ®. For a given basis of V, we can describe ® by means of the transfor- 
mation matrix A. Then the trace of ® is the trace of A. For a different 
basis of V, it holds that the corresponding transformation matrix B of ® 
can be obtained by a basis change of the form S~'AS for suitable S (see 
Section 2.7.2). For the corresponding trace of ®, this means 


tr(B) =tr(S-1AS) “2 (ASS) = tr(A). (4.21) 


Hence, while matrix representations of linear mappings are basis depen- 
dent the trace of a linear mapping ® is independent of the basis. 

In this section, we covered determinants and traces as functions char- 
acterizing a square matrix. Taking together our understanding of determi- 
nants and traces we can now define an important equation describing a 
matrix A in terms of a polynomial, which we will use extensively in the 
following sections. 


Definition 4.5 (Characteristic Polynomial). For A € R and a square ma- 
trix A € R"*” 


pa(A) := det(A — AT) (4.22a) 
= Co FCAT CoA? Hee te A™* + (-1)"X", (4.22b) 
Co;-++,€n—1 € R, is the characteristic polynomial of A. In particular, 
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Co = det(A), (4.23) 
Ga = (- A).. (4.24) 


The characteristic polynomial (4.22a) will allow us to compute eigen- 
values and eigenvectors, covered in the next section. 


4.2 Eigenvalues and Eigenvectors 


We will now get to know a new way to characterize a matrix and its associ- 
ated linear mapping. Recall from Section 2.7.1 that every linear mapping 
has a unique transformation matrix given an ordered basis. We can in- 
terpret linear mappings and their associated transformation matrices by 
performing an “eigen” analysis. As we will see, the eigenvalues of a lin- 
ear mapping will tell us how a special set of vectors, the eigenvectors, is 
transformed by the linear mapping. 


Definition 4.6. Let A € R”*” be a square matrix. Then À € R is an 
eigenvalue of A and x € R”\{0} is the corresponding eigenvector of A if 


Ar = Ax. (4.25) 
We call (4.25) the eigenvalue equation. 


Remark. In the linear algebra literature and software, it is often a conven- 
tion that eigenvalues are sorted in descending order, so that the largest 
eigenvalue and associated eigenvector are called the first eigenvalue and 
its associated eigenvector, and the second largest called the second eigen- 
value and its associated eigenvector, and so on. However, textbooks and 
publications may have different or no notion of orderings. We do not want 
to presume an ordering in this book if not stated explicitly. ro 


The following statements are equivalent: 


= \ is an eigenvalue of A € R”*”. 

= There exists an x € R”\{0O} with Ax = Xz, or equivalently, (A — 
AI,,)x = 0 can be solved non-trivially, i.e., 2 4 0. 

» rk(A—ATI,,) <n. 

= det(A — XAT,,) = 0. 


Definition 4.7 (Collinearity and Codirection). Two vectors that point in 
the same direction are called codirected. Two vectors are collinear if they 
point in the same or the opposite direction. 


Remark (Non-uniqueness of eigenvectors). If a is an eigenvector of A 
associated with eigenvalue A, then for any c € R\{0} it holds that ca is 
an eigenvector of A with the same eigenvalue since 


A(cx) = cAg = ore = A(cx). (4.26) 
Thus, all vectors that are collinear to x are also eigenvectors of A. 


Q 
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Theorem 4.8. A € R is an eigenvalue of A € R”*” if and only if À is a 
root of the characteristic polynomial p(X) of A. 


Definition 4.9. Let a square matrix A have an eigenvalue \;. The algebraic 
multiplicity of A; is the number of times the root appears in the character- 
istic polynomial. 


Definition 4.10 (Eigenspace and Eigenspectrum). For A € R”*”, the set 
of all eigenvectors of A associated with an eigenvalue A spans a subspace 
of IR”, which is called the eigenspace of A with respect to À and is denoted 
by EF). The set of all eigenvalues of A is called the eigenspectrum, or just 
spectrum, of A. 


If \ is an eigenvalue of A € R”*”, then the corresponding eigenspace 
E, is the solution space of the homogeneous system of linear equations 
(A—AI)a = 0. Geometrically, the eigenvector corresponding to a nonzero 
eigenvalue points in a direction that is stretched by the linear mapping. 
The eigenvalue is the factor by which it is stretched. If the eigenvalue is 
negative, the direction of the stretching is flipped. 


Example 4.4 (The Case of the Identity Matrix) 

The identity matrix I € R”*” has characteristic polynomial p;(A) = 
det(I— AL) = (1— 2)" = 0, which has only one eigenvalue A = 1 that oc- 
curs n times. Moreover, [x = \x = 1a holds for all vectors x € R”\{O}. 
Because of this, the sole eigenspace E; of the identity matrix spans n di- 
mensions, and all n standard basis vectors of R” are eigenvectors of I. 


Useful properties regarding eigenvalues and eigenvectors include the 
following: 


= Amatrix A and its transpose A' possess the same eigenvalues, but not 
necessarily the same eigenvectors. 


The eigenspace F) is the null space of A — AJ since 


Az =An <— Ax—-d\x=0 (4.27a) 
<— (A-AI)x =0 <4 weker(A—-Al). (4.27b) 


Similar matrices (see Definition 2.22) possess the same eigenvalues. 
Therefore, a linear mapping ® has eigenvalues that are independent of 
the choice of basis of its transformation matrix. This makes eigenvalues, 
together with the determinant and the trace, key characteristic param- 
eters of a linear mapping as they are all invariant under basis change. 


Symmetric, positive definite matrices always have positive, real eigen- 
values. 
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Example 4.5 (Computing Eigenvalues, Eigenvectors, and 
Eigenspaces) 

Let us find the eigenvalues and eigenvectors of the 2 x 2 matrix 
4 J 


13 (4.28) 


a 
Step 1: Characteristic Polynomial. From our definition of the eigen- 
vector x # O and eigenvalue À of A, there will be a vector such that 
Ax = da, i.e., (A — AI )z = 0. Since x Æ 0, this requires that the kernel 
(null space) of A — AT contains more elements than just 0. This means 
that A — XJ is not invertible and therefore det(A — AI) = 0. Hence, we 
need to compute the roots of the characteristic polynomial (4.22a) to find 
the eigenvalues. 
Step 2: Eigenvalues. The characteristic polynomial is 


pa(A) = det(A — AF) (4.29a) 
4 2 NO ES 

safi l-h A= T sia 6m 

= (4—A)(8—A)-2-1. (4.290) 


We factorize the characteristic polynomial and obtain 
plà) = (4— A)\(3 — à) — 2- 1 = 10 — 7A + à? = (2 — A)(5 — à) (4.30) 


giving the roots à; = 2 and A = 5. 
Step 3: Eigenvectors and Eigenspaces. We find the eigenvectors that 
correspond to these eigenvalues by looking at vectors æ such that 


a 2 


i e (4.31) 


For \ = 5 we obtain 
4-5 2 £i = —1 2 Ly comes 
E 
We solve this homogeneous system and obtain a solution space 
2 
Bs = span{{ 7] : (4.33) 
This eigenspace is one-dimensional as it possesses a single basis vector. 


Analogously, we find the eigenvector for \ = 2 by solving the homoge- 
neous system of equations 


ee e Ea (4.34) 
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Tı 


This means any vector « = | | where x» = —2,, such as | | is an 


v2 —1 
eigenvector with eigenvalue 2. The corresponding eigenspace is given as 


E» = span| Ei ; (4.35) 


The two eigenspaces E; and FE, in Example 4.5 are one-dimensional 
as they are each spanned by a single vector. However, in other cases 
we may have multiple identical eigenvalues (see Definition 4.9) and the 
eigenspace may have more than one dimension. 


Definition 4.11. Let \; be an eigenvalue of a square matrix A. Then the 
geometric multiplicity of A; is the number of linearly independent eigen- 
vectors associated with A;. In other words, it is the dimensionality of the 
eigenspace spanned by the eigenvectors associated with ),. 


Remark. A specific eigenvalue’s geometric multiplicity must be at least 
one because every eigenvalue has at least one associated eigenvector. An 
eigenvalue’s geometric multiplicity cannot exceed its algebraic multiplic- 
ity, but it may be lower. 9 


Example 4.6 


The matrix A = has two repeated eigenvalues à; = A2 = 2 and an 


Deal 
0 2 
algebraic multiplicity of 2. The eigenvalue has, however, only one distinct 


a 1 : Uae 
unit eigenvector #1 = | 4 and, thus, geometric multiplicity 1. 


Graphical Intuition in Two Dimensions 


Let us gain some intuition for determinants, eigenvectors, and eigenval- 
ues using different linear mappings. Figure 4.4 depicts five transformation 


matrices A,,..., As and their impact on a square grid of points, centered 
at the origin: 

+ 0 
" Å; = | 6 j . The direction of the two eigenvectors correspond to the 


canonical basis vectors in R?, i.e., to two cardinal axes. The vertical axis 
is extended by a factor of 2 (eigenvalue ; = 2), and the horizontal axis 
is compressed by factor į (eigenvalue \2 = 5). The mapping is area 
preserving (det(A,) = 1 = 2- 5). 


1 


1" A, = o A corresponds to a shearing mapping , i.e., it shears the 


points along the horizontal axis to the right if they are on the positive 


Draft (2021-01-14) of “Mathematics for Machine Learning”. Feedback: https: //mml-book.com. 


4.2 Eigenvalues and Eigenvectors 109 


Figure 4.4 
Determinants and 
eigenspaces. 


Overview of five 

à = 2.0 linear mappings and 
àz =0.5 their associated 
det(A) = 1.0 transformation 
matrices 
Aj € R?2*? 
projecting 400 
color-coded points 
æ € R? (left 
column) onto target 
points A;a (right 
column). The 
central column 
depicts the first 
eigenvector, 
stretched by its 
associated 
eigenvalue à, and 
the second 
i eigenvector 
aN stretched by its 

"s eigenvalue A2. Each 
row depicts the 


vay, effect of one of five 
d= 0.0 e, i ; 
à= 2.0 e, transformation 


det(A) = 0.0 a matrices A; with 


respect to the 


—_ 


d= 1.0 
dy = 1.0 
det(A) = 1.0 


dy = (0.87-0.5j) 
A2 = (0.87+0.5)) 
det(A) = 1.0 








À 


standard basis. 





à =0.5 
= 15 
det(A) = 0.75 


half of the vertical axis, and to the left vice versa. This mapping is area 
preserving (det(A,) = 1). The eigenvalue à; = 1 = A is repeated 
and the eigenvectors are collinear (drawn here for emphasis in two 
opposite directions). This indicates that the mapping acts only along 
one direction (the horizontal axis). 


cos(2) —sin(= 3 —l1 : 
" Å; = | . (5) cy = z k Al The matrix A; rotates the 
Í sin(ž) cos(%) 1 l 3 l 
points by £ rad = 30° counter-clockwise and has only complex eigen- 
values, reflecting that the mapping is a rotation (hence, no eigenvectors 
are drawn). A rotation has to be volume preserving, and so the deter- 
minant is 1. For more details on rotations, we refer to Section 3.9. 
1 -1 
a Å, = 
s fe 1 
lapses a two-dimensional domain onto one dimension. Since one eigen- 


| represents a mapping in the standard basis that col- 
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Figure 4.5 
Caenorhabditis 
elegans neural 
network (Kaiser and 
Hilgetag, 

2006).(a) Sym- 
metrized 
connectivity matrix; 
(b) Eigenspectrum. 
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value is 0, the space in direction of the (blue) eigenvector corresponding 
to A; = 0 collapses, while the orthogonal (red) eigenvector stretches 


space by a factor Ay = 2. Therefore, the area of the image is 0. 
1 


" Å; = | 1 A is a shear-and-stretch mapping that scales space by 75% 
2 


since |det(A;)| = 3. It stretches space along the (red) eigenvector 
of A. by a factor 1.5 and compresses it along the orthogonal (blue) 
eigenvector by a factor 0.5. 


Example 4.7 (Eigenspectrum of a Biological Neural Network) 





neuron index 
eigenvalue 














aes A i 
50 100 150 200 2 0 100 200 
neuron index index of sorted eigenvalue 


(a) Connectivity matrix. (b) Eigenspectrum. 


Methods to analyze and learn from network data are an essential com- 
ponent of machine learning methods. The key to understanding networks 
is the connectivity between network nodes, especially if two nodes are 
connected to each other or not. In data science applications, it is often 
useful to study the matrix that captures this connectivity data. 

We build a connectivity/adjacency matrix A € IR?’"*?"" of the complete 
neural network of the worm C.Elegans. Each row/column represents one 
of the 277 neurons of this worm’s brain. The connectivity matrix A has 
a value of a;; = 1 if neuron 7 talks to neuron j through a synapse, and 
a;; = 0 otherwise. The connectivity matrix is not symmetric, which im- 
plies that eigenvalues may not be real valued. Therefore, we compute a 
symmetrized version of the connectivity matrix as A,,,, := A+ A’. This 
new matrix A,,,,,. is shown in Figure 4.5(a) and has a nonzero value a;,, if 
and only if two neurons are connected (white pixels), irrespective of the 
direction of the connection. In Figure 4.5(b), we show the correspond- 
ing eigenspectrum of A.,,,,.. The horizontal axis shows the index of the 
eigenvalues, sorted in descending order. The vertical axis shows the corre- 
sponding eigenvalue. The S-like shape of this eigenspectrum is typical for 
many biological neural networks. The underlying mechanism responsible 
for this is an area of active neuroscience research. 
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Theorem 4.12. The eigenvectors 2,,...,x, of a matrix A € R”™” with n 
distinct eigenvalues \,,..., A», are linearly independent. 


This theorem states that eigenvectors of a matrix with n distinct eigen- 
values form a basis of R”. 


Definition 4.13. A square matrix A € R”%*” is defective if it possesses 
fewer than n linearly independent eigenvectors. 


A non-defective matrix A € R”*” does not necessarily require n dis- 
tinct eigenvalues, but it does require that the eigenvectors form a basis of 
R”. Looking at the eigenspaces of a defective matrix, it follows that the 
sum of the dimensions of the eigenspaces is less than n. Specifically, a de- 
fective matrix has at least one eigenvalue \; with an algebraic multiplicity 
m > 1 anda geometric multiplicity of less than m. 


Remark. A defective matrix cannot have n distinct eigenvalues, as distinct 
eigenvalues have linearly independent eigenvectors (Theorem 4.12). © 


Theorem 4.14. Given a matrix A € R™*", we can always obtain a sym- 
metric, positive semidefinite matrix S € R”"*” by defining 





S:=A'‘A. (4.36) 
Remark. If rk(A) = n, then S := A' A is symmetric, positive definite. 
© 


Understanding why Theorem 4.14 holds is insightful for how we can 
use symmetrized matrices: Symmetry requires S = S', and by insert- 
ing (4.36) we obtain S = A'A = A' (A')" = (A' A)" = S'. More- 
over, positive semidefiniteness (Section 3.2.3) requires that x! Sæ > 0 
and inserting (4.36) we obtain x" Sx = x A' Ax = (x" A')(Ax) = 
(Ax)' (Ax) > 0, because the dot product computes a sum of squares 
(which are themselves non-negative). 


Theorem 4.15 (Spectral Theorem). If A € R”*” is symmetric, there ex- 
ists an orthonormal basis of the corresponding vector space V consisting of 
eigenvectors of A, and each eigenvalue is real. 


A direct implication of the spectral theorem is that the eigendecompo- 
sition of a symmetric matrix A exists (with real eigenvalues), and that 
we can find an ONB of eigenvectors so that A = PDP", where D is 
diagonal and the columns of P contain the eigenvectors. 


Example 4.8 
Consider the matrix 


(4.37) 


> 
I 
Nm Ww ow 
mown 
WNN 
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The characteristic polynomial of A is 
A eee) et (4.38) 


so that we obtain the eigenvalues \, = 1 and A» = 7, where A; is a 
repeated eigenvalue. Following our standard procedure for computing 
eigenvectors, we obtain the eigenspaces 


—1 —1 1 
E, = span|| 1 |,| 0 |], Æz= span[|1|]. (4.39) 
0 1 1 
= a a 


We see that x3 is orthogonal to both x, and x). However, since Œ] £ = 
1 # O, they are not orthogonal. The spectral theorem (Theorem 4.15) 
states that there exists an orthogonal basis, but the one we have is not 
orthogonal. However, we can construct one. 

To construct such a basis, we exploit the fact that a2), x2 are eigenvec- 
tors associated with the same eigenvalue A. Therefore, for any a, 3 € R it 
holds that 


A(aa, + 6x2) = Axia + Are8 = (ax + Bae), (4.40) 


i.e., any linear combination of xı and az is also an eigenvector of A as- 
sociated with A. The Gram-Schmidt algorithm (Section 3.8.3) is a method 
for iteratively constructing an orthogonal/orthonormal basis from a set of 
basis vectors using such linear combinations. Therefore, even if x, and a 
are not orthogonal, we can apply the Gram-Schmidt algorithm and find 


eigenvectors associated with 4, = 1 that are orthogonal to each other 
(and to a3). In our example, we will obtain 
—1 alles 
teat) tlie: ales s E (4.41) 
0 2 


which are orthogonal to each other, orthogonal to x3, and eigenvectors of 
A associated with à; = 1. 


Before we conclude our considerations of eigenvalues and eigenvectors 
it is useful to tie these matrix characteristics together with the concepts of 
the determinant and the trace. 


Theorem 4.16. The determinant of a matrix A € R”*%” is the product of 
its eigenvalues, i.e., 


n 


det(A) = | [A, (4.42) 


i=l 


where A; € C are (possibly repeated) eigenvalues of A. 
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A 


T2 —> 
V2 


Tı Vi 


Theorem 4.17. The trace of a matrix A € R”*” is the sum of its eigenval- 
ues, i.e., 


tr(A) =S°X, (4.43) 
ISL 


where à; € C are (possibly repeated) eigenvalues of A. 


Let us provide a geometric intuition of these two theorems. Consider 
a matrix A € R?*? that possesses two linearly independent eigenvectors 
£1, £2. For this example, we assume (27;, 22) are an ONB of R? so that they 
are orthogonal and the area of the square they span is 1; see Figure 4.6. 
From Section 4.1, we know that the determinant computes the change of 
area of unit square under the transformation A. In this example, we can 
compute the change of area explicitly: Mapping the eigenvectors using 
A gives us vectors vı = Ag, = \,a, and vg = Axy = Aga, i.e., the 
new vectors v; are scaled versions of the eigenvectors x;, and the scaling 
factors are the corresponding eigenvalues \;. v1, V2 are still orthogonal, 
and the area of the rectangle they span is |), o|. 

Given that xı, (in our example) are orthonormal, we can directly 
compute the perimeter of the unit square as 2(1 + 1). Mapping the eigen- 
vectors using A creates a rectangle whose perimeter is 2(|Ai| + |A2|). 
Therefore, the sum of the absolute values of the eigenvalues tells us how 
the perimeter of the unit square changes under the transformation matrix 
A. 


Example 4.9 (Google’s PageRank — Webpages as Eigenvectors) 

Google uses the eigenvector corresponding to the maximal eigenvalue of 
a matrix A to determine the rank of a page for search. The idea for the 
PageRank algorithm, developed at Stanford University by Larry Page and 
Sergey Brin in 1996, was that the importance of any web page can be ap- 
proximated by the importance of pages that link to it. For this, they write 
down all web sites as a huge directed graph that shows which page links 
to which. PageRank computes the weight (importance) x; > 0 of a web 
site a; by counting the number of pages pointing to a;. Moreover, PageR- 
ank takes into account the importance of the web sites that link to a;. The 
navigation behavior of a user is then modeled by a transition matrix A of 
this graph that tells us with what (click) probability somebody will end up 
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on a different web site. The matrix A has the property that for any ini- 
tial rank/importance vector x of a web site the sequence x, Ax, A°a,... 
converges to a vector x*. This vector is called the PageRank and satisfies 
Ax* = x*,i.e., it is an eigenvector (with corresponding eigenvalue 1) of 
A. After normalizing a*, such that ||a*|| = 1, we can interpret the entries 
as probabilities. More details and different perspectives on PageRank can 
be found in the original technical report (Page et al., 1999). 


4.3 Cholesky Decomposition 


There are many ways to factorize special types of matrices that we en- 
counter often in machine learning. In the positive real numbers, we have 
the square-root operation that gives us a decomposition of the number 
into identical components, e.g., 9 = 3.- 3. For matrices, we need to be 
careful that we compute a square-root-like operation on positive quanti- 
ties. For symmetric, positive definite matrices (see Section 3.2.3), we can 
choose from a number of square-root equivalent operations. The Cholesky 
decomposition/Cholesky factorization provides a square-root equivalent op- 
eration on symmetric, positive definite matrices that is useful in practice. 


Theorem 4.18 (Cholesky Decomposition). A symmetric, positive definite 
matrix A can be factorized into a product A = LL', where L is a lower- 
triangular matrix with positive diagonal elements: 


Qil >t Qin li ce 0 li e dlm 
ile * iau Mi Eo o ie (4.44) 


GQni *** Gnn lai MEG liri 0 C lan 
L is called the Cholesky factor of A, and L is unique. 
Example 4.10 (Cholesky Factorization) 


Consider a symmetric, positive definite matrix A € R?*3. We are inter- 
ested in finding its Cholesky factorization A = PEREN 


Qı Q21 Q31 lı 0 0 li dla dla 
A = |@21 Q22 Q32| = LL' = Loy loo 0 0 loo la2 2 (4.45) 
431 432 433 I31 ls2 l33 0 0 lss 


Multiplying out the right-hand side yields 


n loibiy Isiliy 
A= |lalu 24+ 03, lzil21 + la2l22 | . (4.46) 
lili lsıila1 + ls2l22 13, +13, + 2, 
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Comparing the left-hand side of (4.45) and the right-hand side of (4.46) 
shows that there is a simple pattern in the diagonal elements l;;: 


li=Van, lz = y 022 = [3,133 = y/@33 — (31 + 132). (4.47) 
Similarly for the elements below the diagonal (l 
also a repeating pattern: 


1 1 


1 
lı = —a2n, 133 = a31, la32 = — (asz — laıilz21). (4.48) 
lı lia loo 


Thus, we constructed the Cholesky decomposition for any symmetric, pos- 
itive definite 3 x 3 matrix. The key realization is that we can backward 
calculate what the components /;; for the L should be, given the values 
aij for A and previously computed values of 1;;. 


ijọ Where ¿i > j), there is 


The Cholesky decomposition is an important tool for the numerical 
computations underlying machine learning. Here, symmetric positive def- 
inite matrices require frequent manipulation, e.g., the covariance matrix 
of a multivariate Gaussian variable (see Section 6.5) is symmetric, positive 
definite. The Cholesky factorization of this covariance matrix allows us to 
generate samples from a Gaussian distribution. It also allows us to perform 
a linear transformation of random variables, which is heavily exploited 
when computing gradients in deep stochastic models, such as the varia- 
tional auto-encoder (Jimenez Rezende et al., 2014; Kingma and Welling, 
2014). The Cholesky decomposition also allows us to compute determi- 
nants very efficiently. Given the Cholesky decomposition A = LL', we 
know that det(A) = det(L) det(L') = det(L)?. Since L is a triangular 
matrix, the determinant is simply the product of its diagonal entries so 
that det(A) = [],/?.. Thus, many numerical software packages use the 
Cholesky decomposition to make computations more efficient. 


4.4 Eigendecomposition and Diagonalization 


A diagonal matrix is a matrix that has value zero on all off-diagonal ele- 
ments, i.e., they are of the form 


Ci er. 0 
D=|: >, Aa is (4.49) 

O e Cn 
They allow fast computation of determinants, powers, and inverses. The 
determinant is the product of its diagonal entries, a matrix power D* is 
given by each diagonal element raised to the power k, and the inverse 


D™ is the reciprocal of its diagonal elements if all of them are nonzero. 
In this section, we will discuss how to transform matrices into diagonal 
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form. This is an important application of the basis change we discussed in 
Section 2.7.2 and eigenvalues from Section 4.2. 

Recall that two matrices A, D are similar (Definition 2.22) if there ex- 
ists an invertible matrix P, such that D = P~' AP. More specifically, we 
will look at matrices A that are similar to diagonal matrices D that con- 
tain the eigenvalues of A on the diagonal. 


Definition 4.19 (Diagonalizable). A matrix A € R”*” is diagonalizable 
if it is similar to a diagonal matrix, i.e., if there exists an invertible matrix 
P € R”” such that D = PAP. 


In the following, we will see that diagonalizing a matrix A € R”*” is 
a way of expressing the same linear mapping but in another basis (see 
Section 2.6.1), which will turn out to be a basis that consists of the eigen- 
vectors of A. 

Let A € R”*”, let Ay,..., A, be a set of scalars, and let p,,...,p,, bea 
set of vectors in IR”. We define P := [p,,...,p,] and let D € R”*” bea 


diagonal matrix with diagonal entries \,,..., A,. Then we can show that 
AP = PD (4.50) 
if and only if A,,..., A, are the eigenvalues of A and p,,...,p,, are cor- 


responding eigenvectors of A. 
We can see that this statement holds because 


AP = Alp,,...,p,| = [Ap,,...,Ap,], (4.51) 
at 0 
PD =[p,,...,P,] ae = [AP]; <- Àn Pn] (4.52) 
0 Xn 
Thus, (4.50) implies that 


Ap, = MP (4.53) 


Ap,, = XnP,, - (4.54) 


Therefore, the columns of P must be eigenvectors of A. 

Our definition of diagonalization requires that P € R”*” is invertible, 
i.e, P has full rank (Theorem 4.3). This requires us to have n linearly 
independent eigenvectors p,,...,p,,, i.e., the p; form a basis of R”. 


Theorem 4.20 (Eigendecomposition). A square matrix A € R”"*” can be 
factored into 


A=PDP", (4.55) 


where P € R”"*” and D is a diagonal matrix whose diagonal entries are 
the eigenvalues of A, if and only if the eigenvectors of A form a basis of IR”. 
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Ae» 


€i 


Ae; 





Theorem 4.20 implies that only non-defective matrices can be diagonal- 
ized and that the columns of P are the n eigenvectors of A. For symmetric 
matrices we can obtain even stronger outcomes for the eigenvalue decom- 
position. 


Theorem 4.21. A symmetric matrix S € R”*” can always be diagonalized. 


Theorem 4.21 follows directly from the spectral theorem 4.15. More- 
over, the spectral theorem states that we can find an ONB of eigenvectors 
of R”. This makes P an orthogonal matrix so that D = P' AP. 


Remark. The Jordan normal form of a matrix offers a decomposition that 
works for defective matrices (Lang, 1987) but is beyond the scope of this 
book. © 


Geometric Intuition for the Eigendecomposition 


We can interpret the eigendecomposition of a matrix as follows (see also 
Figure 4.7): Let A be the transformation matrix of a linear mapping with 
respect to the standard basis. P~' performs a basis change from the stan- 
dard basis into the eigenbasis. This identifies the eigenvectors p,; (blue and 
orange arrows in Figure 4.7) onto the standard basis vectors e;. Then, the 
diagonal D scales the vectors along these axes by the eigenvalues \,. Fi- 
nally, P transforms these scaled vectors back into the standard/canonical 
coordinates yielding À; p;. 


Example 4.11 (Eigendecomposition) 
Let us compute the eigendecomposition of A = 4 E | : 


Step 1: Compute eigenvalues and eigenvectors. The characteristic 
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Figure 4.7 Intuition 
behind the 
eigendecomposition 
as sequential 
transformations. 
Top-left to 
bottom-left: P71 
performs a basis 
change (here drawn 
in R? and depicted 
as a rotation-like 
operation) from the 
standard basis into 
the eigenbasis. 
Bottom-left to 
bottom-right: D 
performs a scaling 
along the remapped 
orthogonal 
eigenvectors, 
depicted here by a 
circle being 
stretched to an 
ellipse. Bottom-right 
to top-right: P 
undoes the basis 
change (depicted as 
a reverse rotation) 
and restores the 
original coordinate 
frame. 


Figure 4.7 visualizes 
the 
eigendecomposition 
5 -2 
of A = E 2 5 
as a sequence of 
linear 
transformations. 
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polynomial of A is 
5-A —-1 
det(A — AT) = det Ta ON T (4.56a) 
2 
Se ee Oe (4.56b) 


Therefore, the eigenvalues of A are à, = T and à> = ž (the roots of the 
characteristic polynomial), and the associated (normalized) eigenvectors 
are obtained via 


ae 7 vy dl 3 


This yields 


Lea el L 


Step 2: Check for existence. The eigenvectors p,,p, form a basis of R?. 
Therefore, A can be diagonalized. 

Step 3: Construct the matrix P to diagonalize A. We collect the eigen- 
vectors of A in P so that 


1 eel 
P= [p,, Po] = -= E i : (4.59) 
We then obtain 


| =p (4.60) 


Equivalently, we get (exploiting that P™* = P' since the eigenvectors 
p, and p, in this example form an ONB) 


a i O 
abe s|~vala iif slvah 1) 4% 
a eee ae eet a aan eee 

pi 


=" Diagonal matrices D can efficiently be raised to a power. Therefore, 
we can find a matrix power for a matrix A € R”*” via the eigenvalue 
decomposition (if it exists) so that 


A" = (PDP) = PDP. (4.62) 


Computing D* is efficient because we apply this operation individually 
to any diagonal element. 


= Assume that the eigendecomposition A = PDP™' exists. Then, 


det(A) = det(PDP™') = det(P) det( D) det(P7}) (4.63a) 
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= det(D) = |] di: (4.63b) 


allows for an efficient computation of the determinant of A. 


The eigenvalue decomposition requires square matrices. It would be 
useful to perform a decomposition on general matrices. In the next sec- 
tion, we introduce a more general matrix decomposition technique, the 
singular value decomposition. 


4.5 Singular Value Decomposition 


The singular value decomposition (SVD) of a matrix is a central matrix 
decomposition method in linear algebra. It has been referred to as the 
“fundamental theorem of linear algebra” (Strang, 1993) because it can be 
applied to all matrices, not only to square matrices, and it always exists. 
Moreover, as we will explore in the following, the SVD of a matrix A, 
which represents a linear mapping ® : V — W, quantifies the change 
between the underlying geometry of these two vector spaces. We recom- 
mend the work by Kalman (1996) and Roy and Banerjee (2014) for a 
deeper overview of the mathematics of the SVD. 


Theorem 4.22 (SVD Theorem). Let A € R™*” be a rectangular matrix of 
rank r € [0, min(m,n)]. The SVD of A is a decomposition of the form 


(4.64) 


with an orthogonal matrix U € R™*™ with column vectors u;, i = 1,..., m, 
and an orthogonal matrix V € R”*" with column vectors vj, 7 =1,...,n. 
Moreover, & is an m x n matrix with Xi; = c; 2 0 and X;;j = 0, i £ j. 


The diagonal entries 0;,2 = 1,...,7r, of X are called the singular values, 
u, are called the left-singular vectors, and v; are called the right-singular 
vectors. By convention, the singular values are ordered, i.e., o1 > 02 > 
o, > 0. 

The singular value matrix X is unique, but it requires some attention. 
Observe that the © € R”*” is rectangular. In particular, & is of the same 
size as A. This means that X has a diagonal submatrix that contains the 
singular values and needs additional zero padding. Specifically, if m > n, 
then the matrix & has diagonal structure up to row n and then consists of 
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Figure 4.8 Intuition 
behind the SVD of a 
matrix A € R°*? 

as sequential vi 
transformations. 

Top-left to 

bottom-left: V T 

performs a basis 

change in R?. vr 
Bottom-left to 

bottom-right: © 

scales and maps €p 

from R? to R3. The 

ellipse in the 

bottom-right lives in €i 
R2. The third 
dimension is 
orthogonal to the 
surface of the 


v2 





0' row vectors from n + 1 to m below so that 


elliptical disk. 
Bottom-right to O71 0 0 
top-right: U 
performs a basis 0 0 
change within R3. 0 0 o 
y= r 0 (4.65) 
0 ... 0 


If m < n, the matrix 4 has a diagonal structure up to column m and 
columns that consist of 0 from m+ 1 to n: 


oa 0 0 0.. 0 
ValG ts 0 í ae (4.66) 
0 0 om 0 ws 0 
Remark. The SVD exists for any matrix A € R”*”. co 


4.5.1 Geometric Intuitions for the SVD 


The SVD offers geometric intuitions to describe a transformation matrix 
A. In the following, we will discuss the SVD as sequential linear trans- 
formations performed on the bases. In Example 4.12, we will then apply 
transformation matrices of the SVD to a set of vectors in IR’, which allows 
us to visualize the effect of each transformation more clearly. 

The SVD of a matrix can be interpreted as a decomposition of a corre- 
sponding linear mapping (recall Section 2.7.1) ® : R” —> R”™ into three 
operations; see Figure 4.8. The SVD intuition follows superficially a simi- 
lar structure to our eigendecomposition intuition, see Figure 4.7: Broadly 
speaking, the SVD performs a basis change via V’ followed by a scal- 
ing and augmentation (or reduction) in dimensionality via the singular 
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value matrix 4%. Finally, it performs a second basis change via U. The SVD 
entails a number of important details and caveats, which is why we will 
review our intuition in more detail. 

Assume we are given a transformation matrix of a linear mapping ® : 
Rk” — R” with respect to the standard bases B and C of R” and R”, 
respectively. Moreover, assume a second basis B of R” and Č of R”. Then 


1. The matrix V performs a basis change in the domain R” from B (rep- 
resented by the red and orange vectors v; and v2 in the top-left of Fig- 
ure 4.8) to the standard basis B. V' = V' performs a basis change 
from B to B. The red and orange vectors are now aligned with the 
canonical basis in the bottom-left of Figure 4.8. 

2. Having changed the coordinate system to B, © scales the new coordi- 
nates by the singular values o; (and adds or deletes dimensions), i.e., 
X is the transformation matrix of ® with respect to B and C, rep- 
resented by the red and orange vectors being stretched and lying in 
the e-e plane, which is now embedded in a third dimension in the 
bottom-right of Figure 4.8. 

3. U performs a basis change in the codomain R” from C into the canoni- 
cal basis of R”, represented by a rotation of the red and orange vectors 
out of the e;-e2 plane. This is shown in the top-right of Figure 4.8. 


The SVD expresses a change of basis in both the domain and codomain. 
This is in contrast with the eigendecomposition that operates within the 
same vector space, where the same basis change is applied and then un- 
done. What makes the SVD special is that these two different bases are 
simultaneously linked by the singular value matrix X. 


Example 4.12 (Vectors and the SVD) 

Consider a mapping of a square grid of vectors X € R? that fit in a box of 
size 2 x 2 centered at the origin. Using the standard basis, we map these 
vectors using 


1 —0.8 
A 0) EEV (4.67a) 
1 0 
—0.79 0  —0.62] [1.62 0 
=| 0.38 —0.78 —0.49| | O 1.0 E Sea (4.67b) 
—0.48 —0.62 0.62 0 0 i l 


We start with a set of vectors X (colored dots; see top-left panel of Fig- 
ure 4.9) arranged in a grid. We then apply V' € R?*?, which rotates X. 
The rotated vectors are shown in the bottom-left panel of Figure 4.9. We 
now map these vectors using the singular value matrix © to the codomain 
R? (see the bottom-right panel in Figure 4.9). Note that all vectors lie in 
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It is useful to revise 
basis changes 
(Section 2.7.2), 
orthogonal matrices 
(Definition 3.8) and 
orthonormal bases 
(Section 3.5). 


Figure 4.9 SVD and 
mapping of vectors 
(represented by 
discs). The panels 
follow the same 
anti-clockwise 
structure of 

Figure 4.8. 
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the 21-2» plane. The third coordinate is always 0. The vectors in the 7-2» 
plane have been stretched by the singular values. 

The direct mapping of the vectors Y by A to the codomain R? equals 
the transformation of X by USV ', where U performs a rotation within 
the codomain R® so that the mapped vectors are no longer restricted to 
the x 1-22 plane; they still are on a plane as shown in the top-right panel 
of Figure 4.9. 





1.0 


—1.0 














AAI pe 


AN 














= 
-=15 -1.0 -0.5 0.0 0.5 1.0 1.5 





4.5.2 Construction of the SVD 


We will next discuss why the SVD exists and show how to compute it 
in detail. The SVD of a general matrix shares some similarities with the 
eigendecomposition of a square matrix. 


Remark. Compare the eigendecomposition of an SPD matrix 


S=S'=PDP' (4.68) 
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with the corresponding SVD 


S=UXV'. (4.69) 

If we set 
U=P=V, D=}, (4.70) 
we see that the SVD of SPD matrices is their eigendecomposition. > 


In the following, we will explore why Theorem 4.22 holds and how 
the SVD is constructed. Computing the SVD of A € R™*” is equivalent 
to finding two sets of orthonormal bases U = (u1,..., Um) and V = 
(v1, ..., Un) of the codomain R” and the domain R”, respectively. From 
these ordered bases, we will construct the matrices U and V. 

Our plan is to start with constructing the orthonormal set of right- 
singular vectors v1,...,Un E R”. We then construct the orthonormal set 
of left-singular vectors u1, ..., Um E€ R™. Thereafter, we will link the two 
and require that the orthogonality of the v; is preserved under the trans- 
formation of A. This is important because we know that the images Av; 
form a set of orthogonal vectors. We will then normalize these images by 
scalar factors, which will turn out to be the singular values. 

Let us begin with constructing the right-singular vectors. The spectral 
theorem (Theorem 4.15) tells us that the eigenvectors of a symmetric 
matrix form an ONB, which also means it can be diagonalized. More- 
over, from Theorem 4.14 we can always construct a symmetric, positive 
semidefinite matrix A'A € R”*” from any rectangular matrix A € 
R”*”, Thus, we can always diagonalize A' A and obtain 


A ks 0 
A'A=PDP'=P|: :.. :|P', (4.71) 
O ai Na 
where P is an orthogonal matrix, which is composed of the orthonormal 


eigenbasis. The \; > 0 are the eigenvalues of A' A. Let us assume the 
SVD of A exists and inject (4.64) into (4.71). This yields 


A'A=(UXV')"(UZV')=VE'U'USV'," (4.72) 


where U, V are orthogonal matrices. Therefore, with U'U = I we ob- 
tain 


2 0 0 
A'A=VE'EV'=V|o `>. ol V`. (4.73) 
0 0 & 
Comparing now (4.71) and (4.73), we identify 
VSP’. (4.74) 
CS (4.75) 
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Therefore, the eigenvectors of A' A that compose P are the right-singular 
vectors V of A (see (4.74)). The eigenvalues of A' A are the squared 
singular values of & (see (4.75)). 

To obtain the left-singular vectors U, we follow a similar procedure. 
We start by computing the SVD of the symmetric matrix AA’ € R™*™ 
(instead of the previous A’ A € R”*”). The SVD of A yields 


AA’ =(USV')(USV"')' =USV'VE'U' (4.76a) 
oœ 0 0 

=U lg FS, w U (4.76b) 
0 0 o, 


The spectral theorem tells us that AA’ = SDS" can be diagonalized 
and we can find an ONB of eigenvectors of AA', which are collected in 
S. The orthonormal eigenvectors of AA are the left-singular vectors U 
and form an orthonormal basis in the codomain of the SVD. 

This leaves the question of the structure of the matrix X. Since AA‘ 
and A' A have the same nonzero eigenvalues (see page 106), the nonzero 
entries of the X matrices in the SVD for both cases have to be the same. 

The last step is to link up all the parts we touched upon so far. We have 
an orthonormal set of right-singular vectors in V. To finish the construc- 
tion of the SVD, we connect them with the orthonormal vectors U. To 
reach this goal, we use the fact the images of the v; under A have to be 
orthogonal, too. We can show this by using the results from Section 3.4. 
We require that the inner product between Av; and Av; must be 0 for 
i £ j. For any two orthogonal eigenvectors v;, vj, i # j, it holds that 


(Av;)'(Av,;) =v} (A' A)v; =v) (A;v;) =A;v) vj =0. (4.77) 


For the case m > r, it holds that {Av,,...,Av,} is a basis of an r- 
dimensional subspace of IR”. 

To complete the SVD construction, we need left-singular vectors that 
are orthonormal: We normalize the images of the right-singular vectors 
Av, and obtain 


Av; 1 1 
Uj: [Av] eo es Av;, (4.78) 
where the last equality was obtained from (4.75) and (4.76b), showing 
us that the eigenvalues of AA' are such that o? = )j. 

Therefore, the eigenvectors of A'A, which we know are the right- 
singular vectors v;, and their normalized images under A, the left-singular 
vectors u;, form two self-consistent ONBs that are connected through the 
singular value matrix ©. 

Let us rearrange (4.78) to obtain the singular value equation 


AV; = Citi, er (4.79) 
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This equation closely resembles the eigenvalue equation (4.25), but the 
vectors on the left- and the right-hand sides are not the same. 

For n < m, (4.79) holds only for i < n, but (4.79) says nothing about 
the w; for i > n. However, we know by construction that they are or- 
thonormal. Conversely, for m < n, (4.79) holds only for i < m. For i > m, 
we have Av; = 0 and we still know that the v; form an orthonormal set. 
This means that the SVD also supplies an orthonormal basis of the kernel 
(null space) of A, the set of vectors x with Aa = 0 (see Section 2.7.3). 

Concatenating the v; as the columns of V and the u; as the columns of 
U yields 


AV =US, (4.80) 


where © has the same dimensions as A and a diagonal structure for rows 
1,...,7r. Hence, right-multiplying with V' yields A = UEV ', which is 
the SVD of A. 


Example 4.13 (Computing the SVD) 
Let us find the singular value decomposition of 
oe 4 (4.81) 


ae ea 


The SVD requires us to compute the right-singular vectors v,, the singular 
values o;, and the left-singular vectors u;. 

Step 1: Right-singular vectors as the eigenbasis of A’ A. 

We start by computing 


ee era 5 -2 1 
A'A=]|0 1 iS i lc -2 1 O|. (4.82) 
ERO O 


We compute the singular values and right-singular vectors v; through 
the eigenvalue decomposition of A' A, which is given as 


Ta 0 T 6 0 0 Ti i a Tai 
an —2 1 —2 T 2 a 
E a a te Nella: eeered ates 
(4.83) 


and we obtain the right-singular vectors as the columns of P so that 


Sie eee 
See 
v30 V5 V6 


Step 2: Singular-value matrix. 
As the singular values o; are the square roots of the eigenvalues of 
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A' A we obtain them straight from D. Since rk(A) = 2, there are only 
two nonzero singular values: o, = V6 and o2 = 1. The singular value 
matrix must be the same size as A, and we obtain 


[v6 0 0 
zi, 1 AG (4.85) 


Step 3: Left-singular vectors as the normalized image of the right- 
singular vectors. 

We find the left-singular vectors by computing the image of the right- 
singular vectors under A and normalizing them by dividing them by their 
corresponding singular value. We obtain 


om 

el See Glee I elle 

wazdn= elo t d aA e 

v30 
0 2 

E eel fee ail ieee 
V5 

1 i 2 


Note that on a computer the approach illustrated here has poor numerical 
behavior, and the SVD of A is normally computed without resorting to the 
eigenvalue decomposition of A' A. 


4.5.3 Eigenvalue Decomposition vs. Singular Value Decomposition 


Let us consider the eigendecomposition A = PDP™' and the SVD A = 
UV ' and review the core elements of the past sections. 


= The SVD always exists for any matrix R™*”. The eigendecomposition is 
only defined for square matrices IR”*” and only exists if we can find a 
basis of eigenvectors of R”. 

= The vectors in the eigendecomposition matrix P are not necessarily 
orthogonal, i.e., the change of basis is not a simple rotation and scaling. 
On the other hand, the vectors in the matrices U and V in the SVD are 
orthonormal, so they do represent rotations. 

= Both the eigendecomposition and the SVD are compositions of three 
linear mappings: 
1. Change of basis in the domain 
2. Independent scaling of each new basis vector and mapping from do- 

main to codomain 

3. Change of basis in the codomain 
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a E 
<a 6 
Star Wars | 5 4 1 —0.6710 0.0236 0.4647 —0.5774 
Blade Runner | 5 5 0 | — | =0.7197 0.2054 —0.4759 0.4619 
Amelie | 0 0 5 —0.0939 —0.7705 —0.5268 —0.3464 
Delicatessen | 1 0 4 —0.1515 —0.6030 0.5293 —0.5774 
9.6438 0 0 
0 6.3639 0 
0 0 0.7056 
0 0 0 


—0.7367 —0.6515 —0.1811 


0.0852 0.1762 —0.9807 
0.6708 —0.7379 —0.0743 


A key difference between the eigendecomposition and the SVD is that 
in the SVD, domain and codomain can be vector spaces of different 
dimensions. 

= In the SVD, the left- and right-singular vector matrices U and V are 
generally not inverse of each other (they perform basis changes in dif- 
ferent vector spaces). In the eigendecomposition, the basis change ma- 
trices P and P™' are inverses of each other. 

= In the SVD, the entries in the diagonal matrix © are all real and non- 
negative, which is not generally true for the diagonal matrix in the 
eigendecomposition. 

« The SVD and the eigendecomposition are closely related through their 
projections 
- The left-singular vectors of A are eigenvectors of AA! 
— The right-singular vectors of A are eigenvectors of A' A. 
— The nonzero singular values of A are the square roots of the nonzero 

eigenvalues of both AA' and A' A. 


= For symmetric matrices A € IR”*”, the eigenvalue decomposition and 
the SVD are one and the same, which follows from the spectral theo- 
rem 4.15. 


Example 4.14 (Finding Structure in Movie Ratings and Consumers) 
Let us add a practical interpretation of the SVD by analyzing data on 
people and their preferred movies. Consider three viewers (Ali, Beatrix, 
Chandra) rating four different movies (Star Wars, Blade Runner, Amelie, 
Delicatessen). Their ratings are values between 0 (worst) and 5 (best) and 
encoded in a data matrix A € R**® as shown in Figure 4.10. Each row 
represents a movie and each column a user. Thus, the column vectors of 
movie ratings, one for each viewer, are 2aji, Lpeatrix, Chandra- 
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Figure 4.10 Movie 
ratings of three 
people for four 
movies and its SVD 
decomposition. 


These two “spaces” 
are only 
meaningfully 
spanned by the 
respective viewer 
and movie data if 
the data itself covers 
a sufficient diversity 
of viewers and 
movies. 
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Factoring A using the SVD offers us a way to capture the relationships 
of how people rate movies, and especially if there is a structure linking 
which people like which movies. Applying the SVD to our data matrix A 
makes a number of assumptions: 


1. All viewers rate movies consistently using the same linear mapping. 

2. There are no errors or noise in the ratings. 

3. We interpret the left-singular vectors u; as stereotypical movies and 
the right-singular vectors v; as stereotypical viewers. 


We then make the assumption that any viewer’s specific movie preferences 
can be expressed as a linear combination of the v;. Similarly, any movie’s 
like-ability can be expressed as a linear combination of the w;. Therefore, 
a vector in the domain of the SVD can be interpreted as a viewer in the 
“space” of stereotypical viewers, and a vector in the codomain of the SVD 
correspondingly as a movie in the “space” of stereotypical movies. Let us 
inspect the SVD of our movie-user matrix. The first left-singular vector uw; 
has large absolute values for the two science fiction movies and a large 
first singular value (red shading in Figure 4.10). Thus, this groups a type 
of users with a specific set of movies (science fiction theme). Similarly, the 
first right-singular v, shows large absolute values for Ali and Beatrix, who 
give high ratings to science fiction movies (green shading in Figure 4.10). 
This suggests that v, reflects the notion of a science fiction lover. 

Similarly, u>, seems to capture a French art house film theme, and v2 
indicates that Chandra is close to an idealized lover of such movies. An 
idealized science fiction lover is a purist and only loves science fiction 
movies, so a science fiction lover v, gives a rating of zero to everything 
but science fiction themed - this logic is implied the diagonal substructure 
for the singular value matrix X. A specific movie is therefore represented 
by how it decomposes (linearly) into its stereotypical movies. Likewise, a 
person would be represented by how they decompose (via linear combi- 
nation) into movie themes. 


It is worth, to briefly discuss SVD terminology and conventions, as there 
are different versions used in the literature. The mathematics remains in- 
variant to these differences, but these differences can be confusing. 


= For convenience in notation and abstraction, we use an SVD notation 
where the SVD is described as having two square left- and right-singular 
vector matrices, but a non-square singular value matrix. Our defini- 
tion (4.64) for the SVD is sometimes called the full SVD. 
= Some authors define the SVD a bit differently and focus on square sin- 
gular matrices. Then, for A € R™*” and m > n, 
AS UV (4.89) 


mxn MXN NXN NXN 
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Sometimes this formulation is called the reduced SVD (e.g., Datta (2010)) reduced SvD 


or the SVD (e.g., Press et al. (2007)). This alternative format changes 
merely how the matrices are constructed but leaves the mathematical 
structure of the SVD unchanged. The convenience of this alternative 
formulation is that © is diagonal, as in the eigenvalue decomposition. 

« In Section 4.6, we will learn about matrix approximation techniques 
using the SVD, which is also called the truncated SVD. 

= It is possible to define the SVD of a rank-r matrix A so that U is an 
m x r matrix, X a diagonal matrix r x r, and V an r x n matrix. 
This construction is very similar to our definition, and ensures that the 
diagonal matrix © has only nonzero entries along the diagonal. The 
main convenience of this alternative notation is that 4 is diagonal, as 
in the eigenvalue decomposition. 

= A restriction that the SVD for A only applies to m x n matrices with 
m > n is practically unnecessary. When m < n, the SVD decomposition 
will yield © with more zero columns than rows and, consequently, the 
singular values 0,,41,--.,0n are 0. 


The SVD is used in a variety of applications in machine learning from 
least-squares problems in curve fitting to solving systems of linear equa- 
tions. These applications harness various important properties of the SVD, 
its relation to the rank of a matrix, and its ability to approximate matrices 
of a given rank with lower-rank matrices. Substituting a matrix with its 
SVD has often the advantage of making calculation more robust to nu- 
merical rounding errors. As we will explore in the next section, the SVD’s 
ability to approximate matrices with “simpler” matrices in a principled 
manner opens up machine learning applications ranging from dimension- 
ality reduction and topic modeling to data compression and clustering. 


4.6 Matrix Approximation 


We considered the SVD as a way to factorize A = USV' € R™*” into 
the product of three matrices, where U € R™*™ and V € R”*” are or- 
thogonal and © contains the singular values on its main diagonal. Instead 
of doing the full SVD factorization, we will now investigate how the SVD 
allows us to represent a matrix A as a sum of simpler (low-rank) matrices 
A;, which lends itself to a matrix approximation scheme that is cheaper 
to compute than the full SVD. 
We construct a rank-1 matrix A; € R™%*” as 


A; := u,v; , (4.90) 


which is formed by the outer product of the ith orthogonal column vector 
of U and V. Figure 4.11 shows an image of Stonehenge, which can be 
represented by a matrix A € R143?x1910, and some outer products A;, as 
defined in (4.90). 
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Figure 4.11 Image 
processing with the 
SVD. (a) The 
original grayscale 
image is a 

1,432 x 1,910 
matrix of values 
between 0 (black) 
and 1 (white). 
(b)—(f) Rank-1 
matrices 
Ai,...,As and 
their corresponding 
singular values 
O1,...,05. The 
grid-like structure of 
each rank-1 matrix 
is imposed by the 
outer-product of the 
left and 
right-singular 
vectors. 


rank-k 
approximation 
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(b) A1, o1 ~ 228, 052. 


(c) Ao, 02 40, 647. 





(d) A3, o3 % 26, 125. 


(e) Ag, 04 X 20, 232. 


(f) As, 05 X 15, 436. 


A matrix A € R”*” of rank r can be written as a sum of rank-1 matrices 
A; so that 


A= SJ ou) = So oA, (4.91) 
w=1 w=1 

where the outer-product matrices A; are weighted by the ith singular 
value o;. We can see why (4.91) holds: The diagonal structure of the 
singular value matrix X multiplies only matching left- and right-singular 
vectors u;v; and scales them by the corresponding singular value o;. All 
terms Lj Uv, vanish for i ¢ 7 because & is a diagonal matrix. Any terms 
i > r vanish because the corresponding singular values are 0. 

In (4.90), we introduced rank-1 matrices A;. We summed up the r in- 
dividual rank-1 matrices to obtain a rank-r matrix A; see (4.91). If the 
sum does not run over all matrices A;, i = 1,...,r, but only up to an 
intermediate value k < r, we obtain a rank-k approximation 


k k 
A(k) = So o;ujv) = So oA, (4.92) 
i=1 i=1 


nN 


of A with rk(A(k)) = k. Figure 4.12 shows low-rank approximations 
A(k) of an original image A of Stonehenge. The shape of the rocks be- 
comes increasingly visible and clearly recognizable in the rank-5 approx- 
imation. While the original image requires 1,432 - 1,910 = 2,735,120 
numbers, the rank-5 approximation requires us only to store the five sin- 
gular values and the five left- and right-singular vectors (1, 432 and 1, 910- 
dimensional each) for a total of 5-(1,432+ 1,910+1) = 16,715 numbers 
— just above 0.6% of the original. 

To measure the difference (error) between A and its rank-k approxima- 
tion A(k), we need the notion of a norm. In Section 3.1, we already used 
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(a) Original image A. 





(d) Rank-3 approximation A(3).(e) Rank-4 approximation A(4).(f) Rank-5 approximation A(5). 


norms on vectors that measure the length of a vector. By analogy we can 
also define norms on matrices. 


Definition 4.23 (Spectral Norm of a Matrix). For æ € R”\{0}, the spectral 
norm of a matrix A € R™*” is defined as 
|| Az], 
|x|. 
We introduce the notation of a subscript in the matrix norm (left-hand 
side), similar to the Euclidean norm for vectors (right-hand side), which 


has subscript 2. The spectral norm (4.93) determines how long any vector 
x can at most become when multiplied by A. 


|| Al], := max (4.93) 


Theorem 4.24. The spectral norm of A is its largest singular value 0. 
We leave the proof of this theorem as an exercise. 


Theorem 4.25 (Eckart-Young Theorem (Eckart and Young, 1936)). Con- 
sider a matrix A € R™*” of rank r and let B € R™”*" be a matrix of rank 
k. For any k < r with A(k) = Sc*_, oyu;v) it holds that 


>) 


(4.94) 
(4.95) 


(k) = argmin,k(B)=k |A- Bll, , 
|a — A(k) 








= OR+1- 
2 


The Eckart-Young theorem states explicitly how much error we intro- 
duce by approximating A using a rank-k approximation. We can inter- 
pret the rank-k approximation obtained with the SVD as a projection of 
the full-rank matrix A onto a lower-dimensional space of rank-at-most-k 
matrices. Of all possible projections, the SVD minimizes the error (with 
respect to the spectral norm) between A and any rank-k approximation. 

We can retrace some of the steps to understand why (4.95) should hold. 
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(b) Rank-1 approximation A(1).(c) Rank-2 approximation A(2). 


Figure 4.12 Image 
reconstruction with 
the SVD. (a) 
Original image. 
(b)-(A) Image 
reconstruction using 
the low-rank 
approximation of 
the SVD, where the 
rank-k 
approximation is 
given by A(k) = 
Di 1 Oi A; . 


spectral norm 


Eckart-Young 
theorem 
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We observe that the difference between A — A(k) is a matrix containing 
the sum of the remaining rank-1 matrices 


A-A(k)= X ool. (4.96) 
i=k+1 
By Theorem 4.24, we immediately obtain cp+1 as the spectral norm of the 
difference matrix. Let us have a closer look at (4.94). If we assume that 
there is another matrix B with rk(B) < k, such that 


|A— Bll, < |a- aw], (4.97) 


then there exists an at least (n — k)-dimensional null space Z C R”, such 
that x € Z implies that Bæ = 0. Then it follows that 


|| Axl], = |(A— B)all, , (4.98) 


and by using a version of the Cauchy-Schwartz inequality (3.17) that en- 
compasses norms of matrices, we obtain 


|| Axl], < |A— Bl], llællz < or lællz - (4.99) 


However, there exists a (kK + 1)-dimensional subspace where ||Aa||, > 
Ox-+1 ||x||,, which is spanned by the right-singular vectors v;, 7 < k +1 of 
A. Adding up dimensions of these two spaces yields a number greater than 
n, as there must be a nonzero vector in both spaces. This is a contradiction 
of the rank-nullity theorem (Theorem 2.24) in Section 2.7.3. 

The Eckart-Young theorem implies that we can use SVD to reduce a 
rank-r matrix A to a rank-k matrix A in a principled, optimal (in the 
spectral norm sense) manner. We can interpret the approximation of A by 
a rank-k matrix as a form of lossy compression. Therefore, the low-rank 
approximation of a matrix appears in many machine learning applications, 
e.g., image processing, noise filtering, and regularization of ill-posed prob- 
lems. Furthermore, it plays a key role in dimensionality reduction and 
principal component analysis, as we will see in Chapter 10. 


Example 4.15 (Finding Structure in Movie Ratings and Consumers 
(continued)) 

Coming back to our movie-rating example, we can now apply the con- 
cept of low-rank approximations to approximate the original data matrix. 
Recall that our first singular value captures the notion of science fiction 
theme in movies and science fiction lovers. Thus, by using only the first 
singular value term in a rank-1 decomposition of the movie-rating matrix, 
we obtain the predicted ratings 


—0.6710 
—0.7197 
—0.0939 
—0.1515 


T 
A, = UV, = 


[—0.7367 -—0.6515 —0.1811] (4.100a) 
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0.4943 0.4372 0.1215 
0.5302 0.4689 0.1303 
= 10.0692 0.0612 0.0170] ° (4.100b) 

0.1116 0.0987 0.0274 


This first rank-1 approximation A, is insightful: it tells us that Ali and 
Beatrix like science fiction movies, such as Star Wars and Bladerunner 
(entries have values > 0.4), but fails to capture the ratings of the other 
movies by Chandra. This is not surprising, as Chandra’s type of movies is 
not captured by the first singular value. The second singular value gives 
us a better rank-1 approximation for those movie-theme lovers: 


0.0236 
0.2054 
—0.7705 
—0.6030 


0.0020 0.0042 —0.0231 
0.0175 0.0362 —0.2014 
=- |—0.0656 —0.1358 0.7556 | ` (7101D) 


—0.0514 —0.1063 0.5914 


T 
A> = U20; = 


[0.0852 0.1762 —0.9807] (4.101a) 


In this second rank-1 approximation A», we capture Chandra’s ratings 
and movie types well, but not the science fiction movies. This leads us to 
consider the rank-2 approximation A(2), where we combine the first two 
rank-1 approximations 


4.7801 4.2419 1.0244 

s 5.2252 4.7522 —0.0250 

AC A e aa ise gradi 
o 0.2756 4.0278 | 


n 


A(2) is similar to the original movie ratings table 


541 
5 5 0 

A= |, 4 5| (4.103) 
104 


and this suggests that we can ignore the contribution of A3. We can in- 
terpret this so that in the data table there is no evidence of a third movie- 
theme/movie-lovers category. This also means that the entire space of 
movie-themes/movie-lovers in our example is a two-dimensional space 
spanned by science fiction and French art house movies and lovers. 


©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020). 


Figure 4.13 A 
functional 
phylogeny of 
matrices 
encountered in 
machine learning. 
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4.7 Matrix Phylogeny 


In Chapters 2 and 3, we covered the basics of linear algebra and analytic 
geometry. In this chapter, we looked at fundamental characteristics of ma- 
trices and linear mappings. Figure 4.13 depicts the phylogenetic tree of 
relationships between different types of matrices (black arrows indicating 
“is a subset of”) and the covered operations we can perform on them (in 
blue). We consider all real matrices A € R”*™. For non-square matrices 
(where n ¢ m), the SVD always exists, as we saw in this chapter. Focus- 
ing on square matrices A € R”*”, the determinant informs us whether a 
square matrix possesses an inverse matrix, i.e., whether it belongs to the 
class of regular, invertible matrices. If the square n x n matrix possesses n 
linearly independent eigenvectors, then the matrix is non-defective and an 
eigendecomposition exists (Theorem 4.12). We know that repeated eigen- 
values may result in defective matrices, which cannot be diagonalized. 

Non-singular and non-defective matrices are not the same. For exam- 
ple, a rotation matrix will be invertible (determinant is nonzero) but not 
diagonalizable in the real numbers (eigenvalues are not guaranteed to be 
real numbers). 
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We dive further into the branch of non-defective square n x n matrices. 
A is normal if the condition A' A = AA' holds. Moreover, if the more 
restrictive condition holds that A' A = AA' = J, then A is called or- 
thogonal (see Definition 3.8). The set of orthogonal matrices is a subset of 
the regular (invertible) matrices and satisfies A= A`. 

Normal matrices have a frequently encountered subset, the symmetric 
matrices S € R”*”, which satisfy S = S! . Symmetric matrices have only 
real eigenvalues. A subset of the symmetric matrices consists of the pos- 
itive definite matrices P that satisfy the condition of x'Pzx > 0 for all 
x € R”\{0}. In this case, a unique Cholesky decomposition exists (Theo- 
rem 4.18). Positive definite matrices have only positive eigenvalues and 
are always invertible (i.e., have a nonzero determinant). 

Another subset of symmetric matrices consists of the diagonal matrices 
D. Diagonal matrices are closed under multiplication and addition, but do 
not necessarily form a group (this is only the case if all diagonal entries 
are nonzero so that the matrix is invertible). A special diagonal matrix is 
the identity matrix T. 


4.8 Further Reading 


Most of the content in this chapter establishes underlying mathematics 
and connects them to methods for studying mappings, many of which are 
at the heart of machine learning at the level of underpinning software so- 
lutions and building blocks for almost all machine learning theory. Matrix 
characterization using determinants, eigenspectra, and eigenspaces pro- 
vides fundamental features and conditions for categorizing and analyzing 
matrices. This extends to all forms of representations of data and map- 
pings involving data, as well as judging the numerical stability of compu- 
tational operations on such matrices (Press et al., 2007). 

Determinants are fundamental tools in order to invert matrices and 
compute eigenvalues “by hand”. However, for almost all but the smallest 
instances, numerical computation by Gaussian elimination outperforms 
determinants (Press et al., 2007). Determinants remain nevertheless a 
powerful theoretical concept, e.g., to gain intuition about the orientation 
of a basis based on the sign of the determinant. Eigenvectors can be used 
to perform basis changes to transform data into the coordinates of mean- 
ingful orthogonal, feature vectors. Similarly, matrix decomposition meth- 
ods, such as the Cholesky decomposition, reappear often when we com- 
pute or simulate random events (Rubinstein and Kroese, 2016). Therefore, 
the Cholesky decomposition enables us to compute the reparametrization 
trick where we want to perform continuous differentiation over random 
variables, e.g., in variational autoencoders (Jimenez Rezende et al., 2014; 
Kingma and Welling, 2014). 

Eigendecomposition is fundamental in enabling us to extract mean- 
ingful and interpretable information that characterizes linear mappings. 
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Therefore, the eigendecomposition underlies a general class of machine 
learning algorithms called spectral methods that perform eigendecomposi- 
tion of a positive-definite kernel. These spectral decomposition methods 
encompass classical approaches to statistical data analysis, such as the 
following: 


= Principal component analysis (PCA (Pearson, 1901), see also Chapter 10), 
in which a low-dimensional subspace, which explains most of the vari- 
ability in the data, is sought. 

a Fisher discriminant analysis, which aims to determine a separating hy- 
perplane for data classification (Mika et al., 1999). 

« Multidimensional scaling (MDS) (Carroll and Chang, 1970). 


The computational efficiency of these methods typically comes from find- 
ing the best rank-k approximation to a symmetric, positive semidefinite 
matrix. More contemporary examples of spectral methods have different 
origins, but each of them requires the computation of the eigenvectors 
and eigenvalues of a positive-definite kernel, such as Isomap (Tenenbaum 
et al., 2000), Laplacian eigenmaps (Belkin and Niyogi, 2003), Hessian 
eigenmaps (Donoho and Grimes, 2003), and spectral clustering (Shi and 
Malik, 2000). The core computations of these are generally underpinned 
by low-rank matrix approximation techniques (Belabbas and Wolfe, 2009) 
as we encountered here via the SVD. 

The SVD allows us to discover some of the same kind of information as 
the eigendecomposition. However, the SVD is more generally applicable 
to non-square matrices and data tables. These matrix factorization meth- 
ods become relevant whenever we want to identify heterogeneity in data 
when we want to perform data compression by approximation, e.g., in- 
stead of storing n xm values just storing (n-++m)k values, or when we want 
to perform data pre-processing, e.g., to decorrelate predictor variables of 
a design matrix (Ormoneit et al., 2001). The SVD operates on matrices, 
which we can interpret as rectangular arrays with two indices (rows and 
columns). The extension of matrix-like structure to higher-dimensional 
arrays are called tensors. It turns out that the SVD is the special case of 
a more general family of decompositions that operate on such tensors 
(Kolda and Bader, 2009). SVD-like operations and low-rank approxima- 
tions on tensors are, for example, the Tucker decomposition (Tucker, 1966) 
or the CP decomposition (Carroll and Chang, 1970). 

The SVD low-rank approximation is frequently used in machine learn- 
ing for computational efficiency reasons. This is because it reduces the 
amount of memory and operations with nonzero multiplications we need 
to perform on potentially very large matrices of data (Trefethen and Bau III, 
1997). Moreover, low-rank approximations are used to operate on ma- 
trices that may contain missing values as well as for purposes of lossy 
compression and dimensionality reduction (Moonen and De Moor, 1995; 
Markovsky, 2011). 


Draft (2021-01-14) of “Mathematics for Machine Learning”. Feedback: https: //mml1-book.com. 


Exercises 137 


4.1 


4.2 


4.4 


4.5 


4.6 


Exercises 


Compute the determinant using the Laplace expansion (using the first row) 
and the Sarrus rule for 


1 
A=} 2 
0 


N eW 
e OQ ot 


Compute the following determinant efficiently: 


2 0 1 2 0 
2 -1 0 1 21 
0 T 2s ie 2 
—2 0 2 -1 2 
2 Or YOY. vil, sl 
Compute the eigenspaces of 
a. 
A= 1 0 
1 1 
b. 
—2 2 
B = 
> 
Compute all eigenspaces of 
0 -1 1 1 
A= oe 2 3 
2 —1 0 0 
L- =f eT 20 


Diagonalizability of a matrix is unrelated to its invertibility. Determine for 
the following four matrices whether they are diagonalizable and/or invert- 


Satter! 


Compute the eigenspaces of the following transformation matrices. Are they 
diagonalizable? 


a. For 
2 3 0 
A=]1 4 3 
0 0 1 
b. For 
1 1 0 0 
0 0 0 0 
Az 0 0 0 0 
0 0 0 0 
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4.7 Are the following matrices diagonalizable? If yes, determine their diagonal 
form and a basis with respect to which the transformation matrices are di- 
agonal. If no, give reasons why they are not diagonalizable. 


a. 
0 1 
a= [is i 
b. 
ete 21: 
A= EY ul: 
1 1 1 
c. 
5 4 2 1 
A= 0 1 —] -l 
—1 —1 3 (0) 
1 1 —1 2 
d. 
5 —6 —6 
A= |-1 4 2 
3 —6 —4 
4.8 Find the SVD of the matrix 
3 2 2 
as i 3 2] 


4.9 Find the singular value decomposition of 


ishg d: 


4.10 Find the rank-1 approximation of 


Az|> 2 2 
2 3 -2 


4.11 Show that for any A € R™*" the matrices A'.A and AA! possess the 
same nonzero eigenvalues. 
4.12 Show that for « 4 0 Theorem 4.24 holds, i.e., show that 
Axl _ 
2 æl 





, 


where o; is the largest singular value of A € R™”*”., 
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Vector Calculus 


Many algorithms in machine learning optimize an objective function with 
respect to a set of desired model parameters that control how well a model 
explains the data: Finding good parameters can be phrased as an opti- 
mization problem (see Sections 8.2 and 8.3). Examples include: (i) lin- 
ear regression (see Chapter 9), where we look at curve-fitting problems 
and optimize linear weight parameters to maximize the likelihood; (ii) 
neural-network auto-encoders for dimensionality reduction and data com- 
pression, where the parameters are the weights and biases of each layer, 
and where we minimize a reconstruction error by repeated application of 
the chain rule; and (iii) Gaussian mixture models (see Chapter 11) for 
modeling data distributions, where we optimize the location and shape 
parameters of each mixture component to maximize the likelihood of the 
model. Figure 5.1 illustrates some of these problems, which we typically 
solve by using optimization algorithms that exploit gradient information 
(Section 7.1). Figure 5.2 gives an overview of how concepts in this chap- 
ter are related and how they are connected to other chapters of the book. 


Central to this chapter is the concept of a function. A function f is 
a quantity that relates two quantities to each other. In this book, these 
quantities are typically inputs x € R? and targets (function values) f(x), 
which we assume are real-valued if not stated otherwise. Here R” is the 
domain of f, and the function values f(a) are the image/codomain of f. 





10 





4 + Training data 
— MLE - 
2 + = 























—4 —2 0 2 4 —10 —5 0 5 10 


(a) Regression problem: Find parameters, (b) Density estimation with a Gaussian mixture 
such that the curve explains the observations model: Find means and covariances, such that 
(crosses) well. the data (dots) can be explained well. 


139 


This material is published by Cambridge University Press as Mathematics for Machine Learning by 
Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong (2020). This version is free to view 
and download for personal use only. Not for re-distribution, re-sale, or use in derivative works. 
©by M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2021. https: //mml-book.com. 








domain 
image/codomain 
Figure 5.1 Vector 
calculus plays a 
central role in (a) 
regression (curve 
fitting) and (b) 
density estimation, 
i.e., modeling data 
distributions. 


Figure 5.2 A mind 
map of the concepts 
introduced in this 
chapter, along with 
when they are used 
in other parts of the 
book. 
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Section 2.7.3 provides much more detailed discussion in the context of 
linear functions. We often write 


f:R? OR (5.1a) 
x> f(x) (5.1b) 


to specify a function, where (5.1a) specifies that f is a mapping from 
R? to R and (5.1b) specifies the explicit assignment of an input x to 
a function value f(a). A function f assigns every input x exactly one 
function value f(x). 


Example 5.1 
Recall the dot product as a special case of an inner product (Section 3.2). 
In the previous notation, the function f(x) = «'a, x € R?, would be 
specified as 
ek (5.2a) 
Lreai+ax5. (5.2b) 


In this chapter, we will discuss how to compute gradients of functions, 
which is often essential to facilitate learning in machine learning models 
since the gradient points in the direction of steepest ascent. Therefore, 
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vector calculus is one of the fundamental mathematical tools we need in 
machine learning. Throughout this book, we assume that functions are 
differentiable. With some additional technical definitions, which we do 
not cover here, many of the approaches presented can be extended to 
sub-differentials (functions that are continuous but not differentiable at 
certain points). We will look at an extension to the case of functions with 
constraints in Chapter 7. 


5.1 Differentiation of Univariate Functions 


In the following, we briefly revisit differentiation of a univariate function, 
which may be familiar from high school mathematics. We start with the 
difference quotient of a univariate function y = f(x), x,y € R, which we 
will subsequently use to define derivatives. 


Definition 5.1 (Difference Quotient). The difference quotient 


by _ f(x + ôr) - f(x) 
J= (5.3) 


computes the slope of the secant line through two points on the graph of 
f. In Figure 5.3, these are the points with x-coordinates x9 and x + 6x. 


The difference quotient can also be considered the average slope of f 
between x and x + da if we assume f to be a linear function. In the limit 
for 6x — 0, we obtain the tangent of f at x, if f is differentiable. The 
tangent is then the derivative of f at x. 


Definition 5.2 (Derivative). More formally, for h > 0 the derivative of f 
at x is defined as the limit 


df jm E+- FO) 


dz h>0 h i 


(5.4) 


and the secant in Figure 5.3 becomes a tangent. 


The derivative of f points in the direction of steepest ascent of f. 
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Figure 5.3 The 
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the incline of the 
secant (blue) 
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f(xo + 6x) and 
given by dy/dz. 
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Example 5.2 (Derivative of a Polynomial) 

We want to compute the derivative of f(x) = x”,n € N. We may already 

know that the answer will be nxz"~!, but we want to derive this result 

using the definition of the derivative as the limit of the difference quotient. 
Using the definition of the derivative in (5.4), we obtain 


df _ 4, fe +h) — f@) 


dg h0 h Coo 

Sn ee (5.5b) 
A ee n—ipt = opn 

as lim S . (5.50) 


We see that x” = a x” °h®. By starting the sum at 1, the «”-term cancels, 
and we obtain 


i ee ere 


dx h-40 h ey 
= lim a )- Da (5.6b) 
Og 
ie HO) snail ne 
= lim (") + @: h (5.6c) 
a 
—0 as h->0 
wen n! n-1 _ n-1 (5 6d) 
= =e =e. ; 


5.1.1 Taylor Series 


The Taylor series is a representation of a function f as an infinite sum of 
terms. These terms are determined using derivatives of f evaluated at xo. 


Definition 5.3 (Taylor Polynomial). The Taylor polynomial of degree n of 
f : R —> R at xo is defined as 


n f(k) 
= ye — ay), (5.7) 


where J (k) (xo) is the kth derivative of f at 2) (which we assume exists) 


and + a are the coefficients of the polynomial. 


Definition 5.4 (Taylor Series). For asmooth function f € C”, f : R > R, 
the Taylor series of f at xo is defined as 
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= f (ao) 
DCE) = x il (£ — £o)". (5.8) 
=0 


For zo = 0, we obtain the Maclaurin series as a special instance of the 
Taylor series. If f(x) = T,.(«), then f is called analytic. 


Remark. In general, a Taylor polynomial of degree n is an approximation 
of a function, which does not need to be a polynomial. The Taylor poly- 
nomial is similar to f in a neighborhood around xo. However, a Taylor 
polynomial of degree n is an exact representation of a polynomial f of 
degree k < n since all derivatives f®, i > k vanish. © 


Example 5.3 (Taylor Polynomial) 
We consider the polynomial 


TO (5.9) 


and seek the Taylor polynomial Tę, evaluated at xọ = 1. We start by com- 
puting the coefficients f™ (1) for k = 0,...,6: 








POA (5.10) 
SA (5.11) 
OR (5.12) 
fea) = 24 (5.13) 
fO) = 24 (5.14) 
f= (5.15) 
Gla (5.16) 
Therefore, the desired Taylor polynomial is 
ENERE) 
ROE 2 f o (z= xo)" (5.17a) 
=0 
= 1 + 4(x — 1) + 6(x — 1) + 4(x — 1)? + (x — 1) +0. (5.17b) 
Multiplying out and re-arranging yields 
Te(x) = (1—4+6-— 4+1) +z(4-— 12+ 12- 4) 
+x?’ (6 — 12 +6) +r? (4— 4) + qf (5.18a) 
= 2 = f(x), (5.18b) 


i.e., we obtain an exact representation of the original function. 
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f € C% means that 


f is continuously 
differentiable 
infinitely many 
times. 

Maclaurin series 
analytic 


Figure 5.4 Taylor 
polynomials. The 
original function 
f(x) = 

sin(x) + cos(x) 
(black, solid) is 
approximated by 
Taylor polynomials 
(dashed) around 
zo = 0. 
Higher-order Taylor 
polynomials 
approximate the 
function f better 
and more globally. 
Tio is already 
similar to f in 
[—4, 4]. 


144 Vector Calculus 

















Example 5.4 (Taylor Series) 
Consider the function in Figure 5.4 given by 


f(x) = sin(x) + cos(x) € C” . (5.19) 


We seek a Taylor series expansion of f at x) = 0, which is the Maclaurin 
series expansion of f. We obtain the following derivatives: 


f(0) = sin(0) + cos(0) = 1 (5.20) 
f' (0) = cos(0) — sin(0) = (5.21) 
f” (0) = — sin(0) — a (5.22) 

f® (0) = — cos(0) + sin(0) = (5.23) 
f(0) = sin(0) + cos(0) = f (0 ‘7 (5.24) 


We can see a pattern here: The coefficients in our Taylor series are only 
+1 (since sin(0) = 0), each of which occurs twice before switching to the 
other one. Furthermore, f(*+*)(0) = f“®)(0). 

Therefore, the full Taylor series expansion of f at x) = 0 is given by 











Se ; 
L(G se ae — To)" (5.25a) 
k=0 i 
= te a ea es 
=1l+z2 ate ale a” + ae ee (5.25b) 
1 1 1 1 
Sy ge a ee (5.25c) 
— 1 = 1 
— el ak +S °(—1)8__—_g (5.25d) 
= (2k)! = (2k + 1)! 
= cos(x) + sin(x), (5.25e) 
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where we used the power series representations 


= 1 
cos(x) = dep” (5.26) 
. A = _4)k 1 2k+1 
sin(x) = > 1) (Ok+ yi" : (5.27) 


Figure 5.4 shows the corresponding first Taylor polynomials T,, for n = 
0, 1,5,10. 


Remark. A Taylor series is a special case of a power series 
f(z) = `. apl — o)" (5.28) 
k=0 


where a, are coefficients and c is a constant, which has the special form 
in Definition 5.4. © 


5.1.2 Differentiation Rules 


In the following, we briefly state basic differentiation rules, where we 
denote the derivative of f by f’. 


Product rule: (f(x)g(x)) = f'(x)g(x) + f(x)g' (x) (5.29) 
f 


Quotient rule: ( = S (5.30) 
Sum rule: (f(x) + glx)y = f'(x) + g'(x) (5.31) 
Chain rule: (g(f(2)))' = (90 f(a) =9'F(@) Fe) 632 


Here, g o f denotes function composition xz +> f(x) > g(f(x)). 


Example 5.5 (Chain Rule) 
Let us compute the derivative of the function h(x) = (2x + 1)* using the 
chain rule. With 


h(x) = (2x + 1)" = g(f(x)), (5.33) 
pe ara (5.34) 
(f) =f}, (5.35) 
we obtain the derivatives of f and g as 
ie 25 (5.36) 
9 (f) =4f", (5.37) 
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representation 
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such that the derivative of h is given as 
h'(x) = g'( P) fæ) = (4f*) -2 C2 (2x +1)? -2 = 8(22 +18, (5.38) 


where we used the chain rule (5.32) and substituted the definition of f 
in (5.34) in g’(f). 


5.2 Partial Differentiation and Gradients 


Differentiation as discussed in Section 5.1 applies to functions f of a 
scalar variable x € R. In the following, we consider the general case 
where the function f depends on one or more variables x € R”, e.g., 
f(x) = f(x1, 22). The generalization of the derivative to functions of sev- 
eral variables is the gradient. 

We find the gradient of the function f with respect to x by varying one 
variable at a time and keeping the others constant. The gradient is then 
the collection of these partial derivatives. 


Definition 5.5 (Partial Derivative). For a function f : R” —> R, x > 








partial derivative f(x), x € R” ofn variables x1, ... , £n we define the partial derivatives as 
of — lim f(x + h, £2,..., En) Z f(x) 
Ox, h0 h 
(5.39) 
OF jm Ettn +A) — Fa) 
Ozn,  h=>0 h 


and collect them in the row vector 


df _Jaf(e) af) afa) 





Vif =gradf = — = = <] e R?*”, (5.40 

P= gradf da Ox, Ox Ozn ( ) 
where n is the number of variables and 1 is the dimension of the image/ 
range/codomain of f. Here, we defined the column vector Œ = |z1,..., £n]! 

gradient € R”. The row vector in (5.40) is called the gradient of f or the Jacobian 

Jacobian and is the generalization of the derivative from Section 5.1. 


Remark. This definition of the Jacobian is a special case of the general 
definition of the Jacobian for vector-valued functions as the collection of 


partial derivatives. We will get back to this in Section 5.3. © 
We can use results 


from scalar 
differentiation: Each 
partial derivative is 


Example 5.6 (Partial Derivatives Using the Chain Rule) 


‘4 derivativeswith For f(x,y) = (x + 2y?)?, we obtain the partial derivatives 
respect toa scalar. 
Of (a, o 
ee y) _ 2(@ + 2y°) = (a + 2y*) = 2(@ + 2y°), (5.41) 


Draft (2021-01-14) of “Mathematics for Machine Learning”. Feedback: https: //mm1-book.com. 


5.2 Partial Differentiation and Gradients 147 


i ea ea loa 2 eA) 
Oy Oy 


where we used the chain rule (5.32) to compute the partial derivatives. 


Remark (Gradient as a Row Vector). It is not uncommon in the literature 
to define the gradient vector as a column vector, following the conven- 
tion that vectors are generally column vectors. The reason why we define 
the gradient vector as a row vector is twofold: First, we can consistently 
generalize the gradient to vector-valued functions f : R” — R™ (then 
the gradient becomes a matrix). Second, we can immediately apply the 
multi-variate chain rule without paying attention to the dimension of the 
gradient. We will discuss both points in Section 5.3. © 


Example 5.7 (Gradient) 
For f (£1, £2) = £?x + 2,23 € R, the partial derivatives (i.e., the deriva- 
tives of f with respect to x, and x2) are 


Ox 
Of (x1, 2 


and the gradient is then 


df Of (L1, £2) Of(£1, £2) 


= a a e aries | eR 
da Ox, OX» aa : 5 : 2] 


(5.45) 








5.2.1 Basic Rules of Partial Differentiation 


In the multivariate case, where x € R”, the basic differentiation rules that 
we know from school (e.g., sum rule, product rule, chain rule; see also 
Section 5.1.2) still apply. However, when we compute derivatives with re- 
spect to vectors x € R” we need to pay attention: Our gradients now 
involve vectors and matrices, and matrix multiplication is not commuta- 
tive (Section 2.2.1), i.e., the order matters. 

Here are the general product rule, sum rule, and chain rule: 


o 0 o 
Product rule: Du (f (@)9(@)) = oF a(x) + fw) 5= (5.46) 
o o o 
Sum rule: zi + g(x)) = of + i (5.47) 
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Product rule: 

(fay = f'9+ fa’, 
Sum rule: 

F+ =F +g, 
Chain rule: 


Y =F 


This is only an 
intuition, but not 
mathematically 
correct since the 
partial derivative is 
not a fraction. 
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o o Og O 
Chain rule: an 9 o f)(x) = —(9(f(x))) = wae (5.48) 


Let us have a closer look at the chain rule. The chain rule (5.48) resem- 
bles to some degree the rules for matrix multiplication where we said that 
neighboring dimensions have to match for matrix multiplication to be de- 
fined; see Section 2.2.1. If we go from left to right, the chain rule exhibits 
similar properties: Of shows up in the “denominator” of the first factor 
and in the “numerator” of the second factor. If we multiply the factors to- 
gether, multiplication is defined, i.e., the dimensions of ô f match, and ðf 
“cancels”, such that 0g/Ox remains. 


5.2.2 Chain Rule 


Consider a function f : R? — R of two variables x,, x. Furthermore, 
x(t) and x(t) are themselves functions of t. To compute the gradient of 
f with respect to t, we need to apply the chain rule (5.48) for multivariate 
functions as 


ajas 2] [esto Of Ox, | Of Ox» 


— = |2 = aLa 5. 
dt Or) Ax, Ot © Ox. Ot’ ie 


O24 0x5 


ð 


where d denotes the gradient and ð partial derivatives. 


Example 5.8 
Consider f (x1, £2) = £? + 2x2, where x, = sint and x2 = cost, then 


TOTS C ai CI te 








—— = 5.50. 
dé Ox, Ot | Oxy Ot (aa) 
Osint Ocost 
= 2sint 2 5.50b 
Ga a (on 
= 2sintcost — 2sint = 2sint(cost — 1) (5.50c) 


is the corresponding derivative of f with respect to t. 


If f(£1, £2) is a function of xı and x2, where xı(s,t) and x2(s,t) are 
themselves functions of two variables s and t, the chain rule yields the 
partial derivatives 


De Om OE Oe Oe” (5S1) 


Ot Ox, OL * Ome OL” 84} 
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and the gradient is obtained by the matrix multiplication 





Ox, Ox, 
df _ Of O® _(af Af) ds oF 
oan SL] dey Ör | (5.53) 


Of Os Ot 
td Samal 
~ Ox E Ox 

~ O(s,t) 


This compact way of writing the chain rule as a matrix multiplication only 
makes sense if the gradient is defined as a row vector. Otherwise, we will 
need to start transposing gradients for the matrix dimensions to match. 
This may still be straightforward as long as the gradient is a vector or a 
matrix; however, when the gradient becomes a tensor (we will discuss this 
in the following), the transpose is no longer a triviality. 





Remark (Verifying the Correctness of a Gradient Implementation). The 
definition of the partial derivatives as the limit of the corresponding dif- 
ference quotient (see (5.39)) can be exploited when numerically checking 
the correctness of gradients in computer programs: When we compute 
gradients and implement them, we can use finite differences to numer- 
ically test our computation and implementation: We choose the value h 
to be small (e.g., h = 10~*) and compare the finite-difference approxima- 
tion from (5.39) with our (analytic) implementation of the gradient. If the 
error is small, our gradient implementation is probably correct. “Small” 


ene < 10~°, where dh, is the finite-difference 
approximation and df; is the analytic gradient of f with respect to the ith 


variable x;. © 


could mean that 


5.3 Gradients of Vector-Valued Functions 


Thus far, we discussed partial derivatives and gradients of functions f : 
IR” — R mapping to the real numbers. In the following, we will generalize 
the concept of the gradient to vector-valued functions (vector fields) f : 
R” > R™, where n > 1 andm> 1. 


For a function f : R” —> R” and a vector æ = |z1,..., £n]! € R”, the 
corresponding vector of function values is given as 
fila) 
fe) = : ER”. (5.54) 
fin(@) 
Writing the vector-valued function in this way allows us to view a vector- 
valued function f : R” — R” as a vector of functions [f1,..., fm], 


f; : R” — R that map onto R. The differentiation rules for every f; are 
exactly the ones we discussed in Section 5.2. 
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Therefore, the partial derivative of a vector-valued function f : R” > 
R™ with respect to x; € R, i = 1,...n, is given as the vector 




















Sh limao filarati aTi heiit) aile) 
of. er e | — 
Ox; — : 7. i A ' 
hn limpo Sulon uai Eti EE) 
(5.55) 


From (5.40), we know that the gradient of f with respect to a vector is 
the row vector of the partial derivatives. In (5.55), every partial derivative 
Of /Ox; is itself a column vector. Therefore, we obtain the gradient of f : 
R” — R” with respect to x € R” by collecting these partial derivatives: 


e-l 
dæ ðr | | Ot, 


(5.56a) 


Ee R™”*”. — (5.56b) 





Definition 5.6 (Jacobian). The collection of all first-order partial deriva- 
tives of a vector-valued function f : R” —> R” is called the Jacobian. The 
Jacobian J is an m x n matrix, which we define and arrange as follows: 

















Jef- a = eee e oe (5.57) 
Ofi(e) Of (x) 
Ox OXn, 
= : : j (5.58) 
[oE State) 
Ox, OX, 
x= A Jli, j) = 2i (5.59) 
; oJ Ox, . : 
Tn 


As a special case of (5.58), a function f : R” —> Rt, which maps a 
vector x € R” onto a scalar (e.g., f(@) = >>/_, vi), possesses a Jacobian 
that is a row vector (matrix of dimension 1 x 1); see (5.40). 


Remark. In this book, we use the numerator layout of the derivative, i.e., 
the derivative df /dx of f € R” with respect to x € R” is an m x 
n matrix, where the elements of f define the rows and the elements of 
x define the columns of the corresponding Jacobian; see (5.58). There 
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exists also the denominator layout, which is the transpose of the numerator 
layout. In this book, we will use the numerator layout. » 


We will see how the Jacobian is used in the change-of-variable method 
for probability distributions in Section 6.7. The amount of scaling due to 
the transformation of a variable is provided by the determinant. 

In Section 4.1, we saw that the determinant can be used to compute 





the area of a parallelogram. If we are given two vectors b; = [1,0]', 
by = [0,1]! as the sides of the unit square (blue; see Figure 5.5), the area 
of this square is 
1 0 
(fi MJ- ae 
If we take a parallelogram with the sides c; = [—2,1]', co = [1,1]! 


(orange in Figure 5.5), its area is given as the absolute value of the deter- 
minant (see Section 4.1) 


—2 1 
aei((5? J)|=1-a=s, a 


i.e., the area of this is exactly three times the area of the unit square. 
We can find this scaling factor by finding a mapping that transforms the 
unit square into the other square. In linear algebra terms, we effectively 
perform a variable transformation from (bı, b2) to (c1, c2). In our case, 
the mapping is linear and the absolute value of the determinant of this 
mapping gives us exactly the scaling factor we are looking for. 

We will describe two approaches to identify this mapping. First, we ex- 
ploit that the mapping is linear so that we can use the tools from Chapter 2 
to identify this mapping. Second, we will find the mapping using partial 
derivatives using the tools we have been discussing in this chapter. 

Approach 1 To get started with the linear algebra approach, we 
identify both {b;, b2} and {c1, C2} as bases of R? (see Section 2.6.1 for a 
recap). What we effectively perform is a change of basis from (b;, by) to 
(C1, C2), and we are looking for the transformation matrix that implements 
the basis change. Using results from Section 2.7.2, we identify the desired 
basis change matrix as 





—2 1 
J= | i , (5.62) 


such that Jb; = cı and Jb = cə. The absolute value of the determi- 
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Figure 5.5 The 
determinant of the 
Jacobian of f can 
be used to compute 
the magnifier 
between the blue 
and orange area. 


denominator layout 


Geometrically, the 
Jacobian 
determinant gives 
the magnification/ 
scaling factor when 
we transform an 
area or volume. 


Jacobian 


determinant 


Figure 5.6 
Dimensionality of 
(partial) derivatives. 
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nant of J, which yields the scaling factor we are looking for, is given as 
|det(J)| = 3, ie., the area of the square spanned by (c1, cz) is three times 
greater than the area spanned by (b4, b2). 

Approach 2 The linear algebra approach works for linear trans- 
formations; for nonlinear transformations (which become relevant in Sec- 
tion 6.7), we follow a more general approach using partial derivatives. 

For this approach, we consider a function f : R* — R? that performs a 
variable transformation. In our example, f maps the coordinate represen- 
tation of any vector x € R? with respect to (b, b2) onto the coordinate 
representation y € R? with respect to (c1, c2). We want to identify the 
mapping so that we can compute how an area (or volume) changes when 
it is being transformed by f. For this, we need to find out how f(æ) 
changes if we modify æ a bit. This question is exactly answered by the 
Jacobian matrix $£ € R?*?, Since we can write 


Yı = —2z1 + T2 (5.63) 
Yo = Tı + T2 (5.64) 


we obtain the functional relationship between æ and y, which allows us 
to get the partial derivatives 


yı Oy OY2 OY2 
Ox, i Xə i Ox, Ox i 
and compose the Jacobian as 
On y 
— Ox, Ox» — —2 1 
J = ôy Oyo | = | 1 ou: (5.66) 
0x1 xə 


The Jacobian represents the coordinate transformation we are looking 
for. It is exact if the coordinate transformation is linear (as in our case), 
and (5.66) recovers exactly the basis change matrix in (5.62). If the co- 
ordinate transformation is nonlinear, the Jacobian approximates this non- 
linear transformation locally with a linear one. The absolute value of the 
Jacobian determinant |det(J)| is the factor by which areas or volumes are 
scaled when coordinates are transformed. Our case yields |det(J)| = 3. 

The Jacobian determinant and variable transformations will become 
relevant in Section 6.7 when we transform random variables and prob- 
ability distributions. These transformations are extremely relevant in ma- 
chine learning in the context of training deep neural networks using the 
reparametrization trick, also called infinite perturbation analysis. 

In this chapter, we encountered derivatives of functions. Figure 5.6 sum- 
marizes the dimensions of those derivatives. If f : R — R the gradient is 
simply a scalar (top-left entry). For f : R? — R the gradient is a 1 x D 
row vector (top-right entry). For f : R > R®, the gradient is an E x 1 
column vector, and for f : R? — R® the gradient is an E x D matrix. 
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Example 5.9 (Gradient of a Vector-Valued Function) 
We are given 


f(x) = Az, POER AERA AT ERA 


To compute the gradient df/dæ we first determine the dimension of 
df /dx: Since f : RN > R”, it follows that df /dx € R”*%. Second, 
to compute the gradient we determine the partial derivatives of f with 
respect to every zj: 


Ofi 





N 
j=l 


We collect the partial derivatives in the Jacobian and obtain the gradient 


of oie 
df aa an Ay oak Ain 
at | eae : =AcR™*" . (5.68) 
dx one One 

oat pee Orn Ami Amn 


Example 5.10 (Chain Rule) 
Consider the function h : R > R, h(t) = (f o g)(t) with 


f: ROR (5.69) 

g: R> R? (5.70) 

f(x) expr). (5.71) 
— ee _ |tcost 

spao [e] e 


and compute the gradient of h with respect to t. Since f : R? > R and 
g : R — R? we note that 





of 1x2 0g 2x1 
—eR* = eR”. 5.73 
T € a E€ (5.73) 
The desired gradient is computed by applying the chain rule: 
a a 2 
T 
= = t 5i 
dt Oa Ot = A T2 (5.74a) 
ot 
A A cost —tsint 
= [exp(a123)23 2 exp(2193)2129| eran er (5.74b) 


= exp(xıx3)(x3}(cost — tsin t) + 22,25(sint+tcost)), (5.749 


where xı = t cost and z = tsin t; see (5.72). 
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Example 5.11 (Gradient of a Least-Squares Loss in a Linear Model) 
Let us consider the linear model 


where 6 € R” is a parameter vector, ® € R‘*” are input features and 
y € R^ are the corresponding observations. We define the functions 


L(e) := |lell’, (5.76) 
e(O) := y— 80. (5.77) 


We seek L , and we will use the chain rule for this purpose. L is called a 


least-squares loss function. 


Before we start our calculation, we determine the dimensionality of the 
gradient as 


OL 1xD_ 
= 5 5: 
30 ER (5.78) 
The chain rule allows us to compute the gradient as 
OL OL0e 
ee eee 5 
0 =0e 00’ S72) 
where the dth element is given by 
= OL, 
=i a| = — 5. 
= Al D al ae —|n, d]. (5.80) 
We know that ||e||? = e' e (see Section 3.2) and determine 
L 
o (5.81) 
ðe 
Furthermore, we obtain 
ðe 
=~ =-®eR*? 5.82 
such that our desired derivative is 
OL | TET Poara 1xD 


1xN NxD 


Remark. We would have obtained the same result without using the chain 
rule by immediately looking at the function 


L2(8) := |ly — 86? = (y — 86)" (y — 0). (5.84) 


This approach is still practical for simple functions like Lə but becomes 
impractical for deep function compositions. O 
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AER”? 





Partial derivatives: 





a E€ Rx? 
La 
` dA E R4*?2*3 
OA ce RIX? da 
Ox collate 
OA 
ie R**2 
Ox, 
4 
3 
— 


2 
(a) Approach 1: We compute the partial derivative 
Bae? 2a, ae each of which is a 4 x 2 matrix, and col- 
late them in a 4 x 2 x 3 tensor. 


dA 8x3 dA 4x2x3 
a ao 


re-shape 





(b) Approach 2: We re-shape (flatten) A € R4*? into a vec- 


tor A € R8. Then, we compute the gradient ga € R&*3, 


We obtain the gradient tensor by re-shaping this gradient as 
illustrated above. 





5.4 Gradients of Matrices 


We will encounter situations where we need to take gradients of matrices 
with respect to vectors (or other matrices), which results in a multidimen- 
sional tensor. We can think of this tensor as a multidimensional array that 
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gradient of 
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collects partial derivatives. For example, if we compute the gradient of an 
m x n matrix A with respect to a p x q matrix B, the resulting Jacobian 
would be (mx n) x (px q), i.e., a four-dimensional tensor J, whose entries 
are given as J; ;,; = OAj;/OBy. 

Since matrices represent linear mappings, we can exploit the fact that 
there is a vector-space isomorphism (linear, invertible mapping) between 
the space R™*” of m x n matrices and the space R”” of mn vectors. 
Therefore, we can re-shape our matrices into vectors of lengths mn and 
pq, respectively. The gradient using these mn vectors results in a Jacobian 
of size mn x pq. Figure 5.7 visualizes both approaches. In practical ap- 
plications, it is often desirable to re-shape the matrix into a vector and 
continue working with this Jacobian matrix: The chain rule (5.48) boils 
down to simple matrix multiplication, whereas in the case of a Jacobian 
tensor, we will need to pay more attention to what dimensions we need 
to sum out. 


Example 5.12 (Gradient of Vectors with Respect to Matrices) 
Let us consider the following example, where 


f- Ar, JER“, AER“ Y wer” (5.85) 


and where we seek the gradient df /dA. Let us start again by determining 
the dimension of the gradient as 


df pm 
— € RMX(MXN) | 5.86 
7A © (5.86) 
By definition, the gradient is the collection of the partial derivatives: 
of, 
df ix OR ami 
Sie en EEL 5.8 
dA ei aA Fae 
fu 
aA 


To compute the partial derivatives, it will be helpful to explicitly write out 
the matrix vector multiplication: 





N 
j=l 
and the partial derivatives are then given as 
Ofi 
= Ta. 5.8 
an eee 


This allows us to compute the partial derivatives of f; with respect to a 
row of A, which is given as 


Ofi 


ae Pea ie (5.90) 
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Ofi 
OAkzi,: 


where we have to pay attention to the correct dimensionality. Since f; 
maps onto R and each row of A is of size 1 x N, we obtain a 1 x 1 x N- 
sized tensor as the partial derivative of f; with respect to a row of A. 

We stack the partial derivatives (5.91) and get the desired gradient 
in (5.87) via 


ONERA (5.91) 











oT 
ot 
Of; = a! ie RIXMXN) (5.92) 
OA T 
0 
for 


Example 5.13 (Gradient of Matrices with Respect to Matrices) 
Consider a matrix R € RY” and f : RMX > RX*N with 


fFR)=R R= KER”, (5.93) 


where we seek the gradient dK /dR. 
To solve this hard problem, let us first write down what we already 
know: The gradient has the dimensions 


a E RINXN)x (MXN) (5.94) 

which is a tensor. Moreover, 
dK pq 
dR 

for p,q = 1,...,.N, where K,, is the (p,q)th entry of K = f(R). De- 


noting the ith column of R by r;, every entry of K is given by the dot 
product of two columns of R, i.e., 


ERELU (5.95) 


Ky = =i pl q = a a na (5.96) 





When we now compute the partial derivative 5 = we obtain 





oK a 
am Yam ORi; RmpBmg = Opais (5.97) 
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Ra ifj =p, p#4q 


_) Rp iff7 =a, p44 
paii = 2Riq ifj=p,p=q ` pie 
0 otherwise 


From (5.94), we know that the desired gradient has the dimension (N x 
N) x (M x N), and every single entry of this tensor is given by 0p,;; 
in (5.98), where p,q,j =1,...,N andi=1,...,M. 


5.5 Useful Identities for Computing Gradients 


In the following, we list some useful gradients that are frequently required 
in a machine learning context (Petersen and Pedersen, 2012). Here, we 
use tr(-) as the trace (see Definition 4.4), det(-) as the determinant (see 
Section 4.1) and f(X)~' as the inverse of f(X), assuming it exists. 




















2 (X) = a (5.99) 
aTa) =e (E) (5.100) 
soe G) = dte (s07 E) (5.101) 
etx =— p(x) FSD pox (5.102) 
= = —(X*)\"ab" (x) (5.103) 
ore Sa (5.104) 
oat ate (5.105) 
ea = -ab (5.106) 
ani - x'(B + B7) (5.107) 
Sle — As)"W(a — As) =—2(@— As)"WA_ for symmetric W 
(5.108) 


Remark. In this book, we only cover traces and transposes of matrices. 
However, we have seen that derivatives can be higher-dimensional ten- 
sors, in which case the usual trace and transpose are not defined. In these 
cases, the trace of a D x D x E x F tensor would be an E x F-dimensional 
matrix. This is a special case of a tensor contraction. Similarly, when we 
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“transpose” a tensor, we mean swapping the first two dimensions. Specif- 
ically, in (5.99) through (5.102), we require tensor-related computations 
when we work with multivariate functions f(-) and compute derivatives 
with respect to matrices (and choose not to vectorize them as discussed in 
Section 5.4). © 


5.6 Backpropagation and Automatic Differentiation 


In many machine learning applications, we find good model parameters 
by performing gradient descent (Section 7.1), which relies on the fact 
that we can compute the gradient of a learning objective with respect 
to the parameters of the model. For a given objective function, we can 
obtain the gradient with respect to the model parameters using calculus 
and applying the chain rule; see Section 5.2.2. We already had a taste in 
Section 5.3 when we looked at the gradient of a squared loss with respect 
to the parameters of a linear regression model. 
Consider the function 


f(x) = 4/2? + exp(z?) + cos (x° + exp(z°)) . (5.109) 


By application of the chain rule, and noting that differentiation is linear, 
we compute the gradient 


df _ 2x + 2x exp(z’) 


= — sin (xz? + exp(x?)) (2x + 2x exp(x? 
dr 2,/x? + exp(z?) ( P(e") ( pie) 


— sin (x° + exo) (1 + exp(z’)) . 


(5.110) 


SAh oe oa a 
2,/x? + exp(x?) 


Writing out the gradient in this explicit way is often impractical since it 
often results in a very lengthy expression for a derivative. In practice, 
it means that, if we are not careful, the implementation of the gradient 
could be significantly more expensive than computing the function, which 
imposes unnecessary overhead. For training deep neural network mod- 
els, the backpropagation algorithm (Kelley, 1960; Bryson, 1961; Dreyfus, 
1962; Rumelhart et al., 1986) is an efficient way to compute the gradient 
of an error function with respect to the parameters of the model. 


5.6.1 Gradients in a Deep Network 


An area where the chain rule is used to an extreme is deep learning, where 
the function value y is computed as a many-level function composition 


y = (fio fri0-+-0 fi)(@) = fie fal (fil@))---)), G11) 


where « are the inputs (e.g., images), y are the observations (e.g., class 
labels), and every function f;,7 = 1,..., HK, possesses its own parameters. 
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A good discussion 
about 
backpropagation 
and the chain rule is 
available at a blog 
by Tim Viera at 
https://tinyurl. 
com/ycfm2yrw. 


backpropagation 


Figure 5.8 Forward 
pass in a multi-layer 
neural network to 
compute the loss L 
as a function of the 
inputs a and the 
parameters A;, b;. 


We discuss the case, 
where the activation 
functions are 
identical in each 
layer to unclutter 
notation. 


A more in-depth 
discussion about 
gradients of neural 
networks can be 
found in Justin 
Domke’s lecture 
notes 
https://tinyurl. 
com/yalcxgtv. 
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fof foes 


Ao, bo Aı,bı Akr-2,bk-2 Ag—1,bK-1 


In neural networks with multiple layers, we have functions f;(a%;_,) = 
o(Ai—-1£i—-1 + 6;_,) in the ith layer. Here x; is the output of layer i — 1 
and o an activation function, such as the logistic sigmoid aw tanh or a 
rectified linear unit (ReLU). In order to train these models, we require the 
gradient of a loss function L with respect to all model parameters A,, b; 
for 7 = 1,..., AK. This also requires us to compute the gradient of L with 
respect to the inputs of each layer. For example, if we have inputs a and 


observations y and a network structure defined by 
fo:= 2 
fi := oil Ai-i fi- + biz), 


see also Figure 5.8 for a visualization, we may be interested in finding 
A;, b; for j = 0,..., K — 1, such that the squared loss 


L(0) = |ly - f x (8, x)|? 


is minimized, where 0 = { Ao, bo,..., Ax—1, 0x_1}-. 
To obtain the gradients with respect to the parameter set 0, we require 
the partial derivatives of L with respect to the parameters 0; = {A}, b;} 


(5.112) 


i=1,...,K, (5.113) 


(5.114) 






































of each layer 7 = 0,..., K — 1. The chain rule allows us to determine the 
partial derivatives as 

a = one (5.115) 

oS = ve Te Ja ee 

Ges BIT, [hase] = SN 

m aroe aa) MD 











The orange terms are partial derivatives of the output of a layer with 
respect to its inputs, whereas the blue terms are partial derivatives of 
the output of a layer with respect to its parameters. Assuming, we have 
already computed the partial derivatives 0L/00;,,, then most of the com- 
putation can be reused to compute 0L/06;. The additional terms that we 
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Ao, bo Ai, bı Ax-2, bk- 


need to compute are indicated by the boxes. Figure 5.9 visualizes that the 
gradients are passed backward through the network. 


Akx-ı,bx-ı 


5.6.2 Automatic Differentiation 


It turns out that backpropagation is a special case of a general technique 
in numerical analysis called automatic differentiation. We can think of au- 
tomatic differentation as a set of techniques to numerically (in contrast to 
symbolically) evaluate the exact (up to machine precision) gradient of a 
function by working with intermediate variables and applying the chain 
rule. Automatic differentiation applies a series of elementary arithmetic 
operations, e.g., addition and multiplication and elementary functions, 
e.g., sin, cos, exp, log. By applying the chain rule to these operations, the 
gradient of quite complicated functions can be computed automatically. 
Automatic differentiation applies to general computer programs and has 
forward and reverse modes. Baydin et al. (2018) give a great overview of 
automatic differentiation in machine learning. 

Figure 5.10 shows a simple graph representing the data flow from in- 
puts x to outputs y via some intermediate variables a,b. If we were to 
compute the derivative dy/dx, we would apply the chain rule and obtain 

dy dy db da 


dz dbdada’ 


Intuitively, the forward and reverse mode differ in the order of multipli- 
cation. Due to the associativity of matrix multiplication, we can choose 


(5.119) 


between 
dy  /dydb\ da 
dy dy /db da 
— = — | —— ]}. 5.121 
dz db & £) ( ) 


Equation (5.120) would be the reverse mode because gradients are prop- 
agated backward through the graph, i.e., reverse to the data flow. Equa- 
tion (5.121) would be the forward mode, where the gradients flow with 
the data from left to right through the graph. 
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Figure 5.9 
Backward pass in a 
multi-layer neural 
network to compute 
the gradients of the 
loss function. 


Figure 5.10 Simple 
graph illustrating 
the flow of data 
from z to y via 
some intermediate 
variables a, b. 


automatic 
differentiation 


Automatic 
differentiation is 
different from 
symbolic 
differentiation and 
numerical 
approximations of 
the gradient, e.g., by 
using finite 
differences. 


In the general case, 
we work with 
Jacobians, which 
can be vectors, 
matrices, or tensors. 


reverse mode 


forward mode 


intermediate 
variables 


Figure 5.11 
Computation graph 
with inputs a, 
function values f, 
and intermediate 
variables a,b, c, d, e. 
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In the following, we will focus on reverse mode automatic differentia- 
tion, which is backpropagation. In the context of neural networks, where 
the input dimensionality is often much higher than the dimensionality of 
the labels, the reverse mode is computationally significantly cheaper than 
the forward mode. Let us start with an instructive example. 


Example 5.14 
Consider the function 


f(x) = \/ x? + exp(x?) + cos (x? + exp(z°)) (5.122) 


from (5.109). If we were to implement a function f on a computer, we 
would be able to save some computation by using intermediate variables: 


C=“. (5.123) 
b = exp(a), (5.124) 
c=artb, (5.125) 
yc (5.126) 
e = cos(c), (5.127) 
f=dte. (5.128) 





This is the same kind of thinking process that occurs when applying 
the chain rule. Note that the preceding set of equations requires fewer 
operations than a direct implementation of the function f(x) as defined 
in (5.109). The corresponding computation graph in Figure 5.11 shows 
the flow of data and computations required to obtain the function value 
F 
The set of equations that include intermediate variables can be thought 
of as a computation graph, a representation that is widely used in imple- 
mentations of neural network software libraries. We can directly compute 
the derivatives of the intermediate variables with respect to their corre- 
sponding inputs by recalling the definition of the derivative of elementary 
functions. We obtain the following: 


Oa 
Ob 
< = exp(a) (5.130) 
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= Be - (5.131) 
= = (5.132) 
we = —sin(e) (5.133) 
z =l = a (5.134) 


By looking at the computation graph in Figure 5.11, we can compute 
Of /Ox by working backward from the output and obtain 


Of _afdd , af de 


ee Gee eae ae eS 
oe (5.136) 
a e (5.137) 
a = o (5.138) 


Note that we implicitly applied the chain rule to obtain Of /Ox. By substi- 
tuting the results of the derivatives of the elementary functions, we get 


1 
= =o OT +1-(—sin(c)) (5.139) 
of að 
- = a E (5.140) 
af of af 
= = 2 A (5.142) 


By thinking of each of the derivatives above as a variable, we observe 
that the computation required for calculating the derivative is of similar 
complexity as the computation of the function itself. This is quite counter- 
intuitive since the mathematical expression for the derivative of (5.110) 
is significantly more complicated than the mathematical expression of the 
function f(x) in (5.109). 


Automatic differentiation is a formalization of Example 5.14. Let x1,..., x4 
be the input variables to the function, 7441,...,£%p_, be the intermediate 
variables, and xp the output variable. Then the computation graph can be 
expressed as follows: 


Fori =d+1,...,D: Ti = G(T Pata) (5.143) 
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Auto-differentiation 
in reverse mode 
requires a parse 
tree. 


Hessian 
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where the g;(-) are elementary functions and xp,,,) are the parent nodes 
of the variable x; in the graph. Given a function defined in this way, we 
can use the chain rule to compute the derivative of the function in a step- 
by-step fashion. Recall that by definition f = xp and hence 


OF 


Oxp 


1. (5.144) 


For other variables x;, we apply the chain rule 


of Of Ox; _ Of 09; 


zj:xıEPa(zj zj:x;EPa(z;) 


where Pa(zx;) is the set of parent nodes of x; in the computation graph. 
Equation (5.143) is the forward propagation of a function, whereas (5.145) 
is the backpropagation of the gradient through the computation graph. 
For neural network training, we backpropagate the error of the prediction 
with respect to the label. 

The automatic differentiation approach above works whenever we have 
a function that can be expressed as a computation graph, where the ele- 
mentary functions are differentiable. In fact, the function may not even be 
a mathematical function but a computer program. However, not all com- 
puter programs can be automatically differentiated, e.g., if we cannot find 
differential elementary functions. Programming structures, such as for 
loops and if statements, require more care as well. 


5.7 Higher-Order Derivatives 


So far, we have discussed gradients, i.e., first-order derivatives. Some- 
times, we are interested in derivatives of higher order, e.g., when we want 
to use Newton’s Method for optimization, which requires second-order 
derivatives (Nocedal and Wright, 2006). In Section 5.1.1, we discussed 
the Taylor series to approximate functions using polynomials. In the mul- 
tivariate case, we can do exactly the same. In the following, we will do 
exactly this. But let us start with some notation. 

Consider a function f : R? — R of two variables x,y. We use the 
following notation for higher-order partial derivatives (and for gradients): 








2¢ . . . . . 
= of is the second partial derivative of f with respect to x. 
= as is the nth partial derivative of f with respect to x. 
Zf _ a (/əƏf\:; A g ‘ ‘ : * 
ogee Oy (3) is the partial derivative obtained by first partial differ- 
entiating with respect to x and then with respect to y. 
er % A z ‘ $ x ` ‘ He 
Bae jy is the partial derivative obtained by first partial differentiating by 


y and then x. 


The Hessian is the collection of all second-order partial derivatives. 
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Flo) + f’(xo)(a — 20) 














If f(x,y) is a twice (continuously) differentiable function, then 
is eee a 
OxOy yðr’ 


i.e., the order of differentiation does not matter, and the corresponding 
Hessian matrix 


(5.146) 








Of Of 
H= F Oey (5.147) 
dxdy Oy? 


is symmetric. The Hessian is denoted as Viy f(x,y). Generally, for x € R” 
and f : R” —> R, the Hessian is an n x n matrix. The Hessian measures 
the curvature of the function locally around (x, y). 


Remark (Hessian of a Vector Field). If f : R” — R” is a vector field, the 
Hessian is an (m x n x n)-tensor. © 


5.8 Linearization and Multivariate Taylor Series 


The gradient Vf of a function f is often used for a locally linear approxi- 
mation of f around x9: 


f(x) = f(ao) + (Vaf)(@o)(x — £o). (5.148) 


Here (V,f)(ao) is the gradient of f with respect to x, evaluated at xo. 
Figure 5.12 illustrates the linear approximation of a function f at an input 
zo. The original function is approximated by a straight line. This approx- 
imation is locally accurate, but the farther we move away from zo the 
worse the approximation gets. Equation (5.148) is a special case of a mul- 
tivariate Taylor series expansion of f at £o, where we consider only the 
first two terms. We discuss the more general case in the following, which 
will allow for better approximations. 
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Figure 5.12 Linear 
approximation of a 
function. The 
original function f 
is linearized at 

zo = —2 using a 
first-order Taylor 
series expansion. 


Hessian matrix 


Figure 5.13 
Visualizing outer 
products. Outer 
products of vectors 
increase the 
dimensionality of 
the array by 1 per 
term. (a) The outer 
product of two 
vectors results in a 
matrix; (b) the 
outer product of 
three vectors yields 
a third-order tensor. 


multivariate Taylor 
series 


Taylor polynomial 


A vector can be 
implemented as a 
one-dimensional 
array, a matrix as a 
two-dimensional 
array. 
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j EER 


(a) Given a vector 6 € R+, we obtain the outer product 67 := 6 @6 =66' € 
R44 as a matrix. 


| emea 
& e = = 
) 


(b) An outer product ô? := 6@6@6 € R*%4*4 results in a third-order tensor (“three- 
dimensional matrix”), i.e., an array with three indexes. 

















© 






















































































Definition 5.7 (Multivariate Taylor Series). We consider a function 
f:R? SR 
x> f(x), 


(5.149) 


x2eR?, (5.150) 


that is smooth at xo. When we define the difference vector 6 := x — 29, 
the multivariate Taylor series of f at (ao) is defined as 


fla) = > Pat eo) ge 


k=0 


(5.151) 


where D* f(a) is the k-th (total) derivative of f with respect to æ, eval- 
uated at x. 


Definition 5.8 (Taylor Polynomial). The Taylor polynomial of degree n of 
f at xo contains the first n + 1 components of the series in (5.151) and is 
defined as 


(5.152) 


In (5.151) and (5.152), we used the slightly sloppy notation of 6”, 
which is not defined for vectors x € R?, D > 1, and k > 1. Note that 
both DE f and ô" are k-th order tensors, i.e., k-dimensional arrays. The 

k times 
kth-order tensor 6° € R?*?*=*P is obtained as a k-fold outer product, 
denoted by ®, of the vector ô € R?. For example, 


5 :=6@6=66', öli, j] = slilólj] (5.153) 
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6 := 682686, S fi, j, k] = ôlilólj]ólk] . (5.154) 


Figure 5.13 visualizes two such outer products. In general, we obtain the 
terms 
D 


DE SF Lo)d =P 5 Dir f (xo) lia, aaa iplóli] - fix] (5.155) 


1=1 tp=l 


in the Taylor series, where D* f(a))d" contains k-th order polynomials. 

Now that we defined the Taylor series for vector fields, let us explicitly 
write down the first terms D* f(a )6" of the Taylor series expansion for 
k=0,...,3 and 6:= £ — £o: 


k = 0 : D? f(£0)ð? = f(ao) CR (5.156) 
D 

= 1: D} f(xo)ð" = Vaf (£0) 6, = X Vaf (zo)[i]ðli] € R (5.157) 
1xD Da = 

k = 2 : D? f(x9)ð’ = tr( H(ao) 6/5") = 6' H(a)6 (5.158) 


Dxp Px11xD 


D D 
= D JER (5.159) 


=3:D3f Te rf (o)[i, j, klólilðljlð[k] € R 


(5.160) 


Here, H (xo) is the Hessian of f evaluated at xo. 


Example 5.15 (Taylor Series Expansion of a Function with Two Vari- 
ables) 
Consider the function 


f(z, y) = £? + 2ry +y’. (5.161) 


We want to compute the Taylor series expansion of f at (£o, yo) = (1,2). 
Before we start, let us discuss what to expect: The function in (5.161) is 
a polynomial of degree 3. We are looking for a Taylor series expansion, 
which itself is a linear combination of polynomials. Therefore, we do not 
expect the Taylor series expansion to contain terms of fourth or higher 
order to express a third-order polynomial. This means that it should be 
sufficient to determine the first four terms of (5.151) for an exact alterna- 
tive representation of (5.161). 

To determine the Taylor series expansion, we start with the constant 
term and the first-order derivatives, which are given by 


f(1,2) =13 (5.162) 
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of of a 
of 2 of _ 


Therefore, we obtain 
Di, f(1,2) = Vewf(,2) = [20,2) #0,2)] =[6 14] e Ree 
(5.165) 
such that 


Dey 12) 
1! 


x— 1 


ô= [6 14] >: 


| = 6(x — 1) + 14(y — 2). (5.166) 
Note that D}.,, f(1,2)6 contains only linear terms, i.e., first-order polyno- 
mials. 

The second-order partial derivatives are given by 




















Of of 
Ox? Ba? | 2) ( 2 
8f of 
— = 6y = — (1,2) = 12 5.168 
Oy? yY Oy? ( > ) ( ) 
o? f OF 
OyOx a A Tee) 
oF Of 
DPE i, 7) = 2. 5.170 
OxOy Andy ) ( ) 
When we collect the second-order partial derivatives, we obtain the Hes- 
sian 
DE Ei 
= | ae l = ; | : (5.171) 
ðyðr Oy? Y 
such that 
H(1,2) = 2 2| eR? (5.172) 
A ME) j Í 
Therefore, the next term of the Taylor-series expansion is given by 
Dm2 1 
y = 5° AL, 2)6 (5.173a) 


1 z 
=5[e-1 9-2 5 a p - J (5.173b) 


= (x — 1)? + 2(x —1)(y—2) + 6(y—2)?. (5.1739 


Here, De f (1, 2)6” contains only quadratic terms, i.e., second-order poly- 
nomials. 
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The third-order derivatives are obtained as 














aH ƏH 
D f= [= an] E R22, (5.174) 
OH ee 
Deyfl, +1] = Ze = | en alle (5.175) 
OxOyOx  Oxdy? 
oH [245 53 
DE WS a = E a i (5.176) 
Y Oy? dx dys 


Since most second-order partial derivatives in the Hessian in (5.171) are 
constant, the only nonzero third-order partial derivative is 
of of 
a 6 = 
Oy? Oy? 
Higher-order derivatives and the mixed derivatives of degree 3 (e.g., 
af? : 
D20) vanish, such that 


(1,2) =6. (5.177) 


ee ey en |e ae Sees ee ee |) 
D = k 4 , DAE = k sl (5.178) 
and 
D? lee? 
aes Se (5.179) 


which collects all cubic terms of the Taylor series. Overall, the (exact) 
Taylor series expansion of f at (£o, Yo) = (1, 2) is 


D (1,2) oo a Da yF (1, 2) 











(5.180a) 
o2 Of(1,2 
= 70,294 Ee + ya 
1 (0°f0,2) pee ee) ; 
=a OL. eee Oy? aa 


CPZ) o2) 


, il! ; E 
i aa = y= 2)) a 6 ap 2) (5.180b) 


= 13+ 6(2 — 1) + 14(y — 2) 

+(x —1)? + 6(y — 2)? + 2(@ — 1)(y — 2) + (y— 2)°. (5.180c) 
In this case, we obtained an exact Taylor series expansion of the polyno- 
mial in (5.161), i.e., the polynomial in (5.180c) is identical to the original 
polynomial in (5.161). In this particular example, this result is not sur- 
prising since the original function was a third-order polynomial, which 
we expressed through a linear combination of constant terms, first-order, 
second-order, and third-order polynomials in (5.180c). 
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5.9 Further Reading 


Further details of matrix differentials, along with a short review of the 
required linear algebra, can be found in Magnus and Neudecker (2007). 
Automatic differentiation has had a long history, and we refer to Griewank 
and Walther (2003), Griewank and Walther (2008), and Elliott (2009) 
and the references therein. 

In machine learning (and other disciplines), we often need to compute 
expectations, i.e., we need to solve integrals of the form 


E,[f(2)] = / Tapada (5.181) 


Even if p(x) is in a convenient form (e.g., Gaussian), this integral gen- 
erally cannot be solved analytically. The Taylor series expansion of f is 
one way of finding an approximate solution: Assuming p(x) = N (u, X) 
is Gaussian, then the first-order Taylor series expansion around yp locally 
linearizes the nonlinear function f. For linear functions, we can compute 
the mean (and the covariance) exactly if p(x) is Gaussian distributed (see 
Section 6.5). This property is heavily exploited by the extended Kalman 
filter (Maybeck, 1979) for online state estimation in nonlinear dynami- 
cal systems (also called “state-space models”). Other deterministic ways 
to approximate the integral in (5.181) are the unscented transform (Julier 
and Uhlmann, 1997), which does not require any gradients, or the Laplace 
approximation (MacKay, 2003; Bishop, 2006; Murphy, 2012), which uses 
a second-order Taylor series expansion (requiring the Hessian) for a local 
Gaussian approximation of p(a) around its mode. 


Exercises 
5.1 Compute the derivative f'(x) for 
f(a) = log(a*) sin(x*) . 


5.2 Compute the derivative f'(x) of the logistic sigmoid 


1 
O = Trepa) 


5.3. Compute the derivative f'(x) of the function 
f(2) = exp(- z4 (€ — u)’), 
where u, o € R are constants. 
5.4 Compute the Taylor polynomials Tn, n = 0,...,5 of f(x) = sin(x) + cos(x) 
at zo = 0. 
5.5 Consider the following functions: 
fı(æ) = sin(z1) cos(£2), æ €R? 
folw,y)=a'y, w,yeR” 
fa(a) = aa! , ze R" 
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a. What are the dimensions of of ? 
b. Compute the Jacobians. 


5.6 Differentiate f with respect to t and g with respect to X, where 
f(t) =sin(log(t't)),  teR? 
g(X)=tr(AXB), AER? xX eRe BER”, 
where tr(-) denotes the trace. 
5.7 Compute the derivatives df /dæ of the following functions by using the chain 


rule. Provide the dimensions of every single partial derivative. Describe your 
steps in detail. 


a. 


f(z) =log(1 +z), z=a@'a, xeER? 


f(z) =sin(z), z=Awv+b, AER? we R?,bER® 
where sin(-) is applied to every element of z. 


5.8 Compute the derivatives df/da of the following functions. Describe your 
steps in detail. 


a. Use the chain rule. Provide the dimensions of every single partial deriva- 
tive. 


fle) = exp(—}2) 
z=gy)=y Sty 
y =h(æ)=gz-pu 


where æ, u € RP, S € RP*P. 


f(x) =tr(xx! +0°I), x eR? 
Here tr(A) is the trace of A, i.e., the sum of the diagonal elements A;;. 
Hint: Explicitly write out the outer product. 

c. Use the chain rule. Provide the dimensions of every single partial deriva- 
tive. You do not need to compute the product of the partial derivatives 
explicitly. 

f =tanh(z) € RY 

z=Aa+b, ce€R’, AcR™”" be R™, 

Here, tanh is applied to every component of z. 
5.9 We define 
9(Z,V) := log p(x, z) — log q(z,v) 
z:=t(e,v) 

for differentiable functions p,q,t, and x € R?,z € R”,v € R’,e € R®. By 
using the chain rule, compute the gradient 


d 
<9(2,¥). 
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Probability and Distributions 


Probability, loosely speaking, concerns the study of uncertainty. Probabil- 
ity can be thought of as the fraction of times an event occurs, or as a degree 
of belief about an event. We then would like to use this probability to mea- 
sure the chance of something occurring in an experiment. As mentioned 
in Chapter 1, we often quantify uncertainty in the data, uncertainty in the 
machine learning model, and uncertainty in the predictions produced by 
the model. Quantifying uncertainty requires the idea of a random variable, 
which is a function that maps outcomes of random experiments to a set of 
properties that we are interested in. Associated with the random variable 
is a function that measures the probability that a particular outcome (or 
set of outcomes) will occur; this is called the probability distribution. 

Probability distributions are used as a building block for other con- 
cepts, such as probabilistic modeling (Section 8.4), graphical models (Sec- 
tion 8.5), and model selection (Section 8.6). In the next section, we present 
the three concepts that define a probability space (the sample space, the 
events, and the probability of an event) and how they are related to a 
fourth concept called the random variable. The presentation is deliber- 
ately slightly hand wavy since a rigorous presentation may occlude the 
intuition behind the concepts. An outline of the concepts presented in this 
chapter are shown in Figure 6.1. 


6.1 Construction of a Probability Space 


The theory of probability aims at defining a mathematical structure to 
describe random outcomes of experiments. For example, when tossing a 
single coin, we cannot determine the outcome, but by doing a large num- 
ber of coin tosses, we can observe a regularity in the average outcome. 
Using this mathematical structure of probability, the goal is to perform 
automated reasoning, and in this sense, probability generalizes logical 
reasoning (Jaynes, 2003). 


6.1.1 Philosophical Issues 


When constructing automated reasoning systems, classical Boolean logic 
does not allow us to express certain forms of plausible reasoning. Consider 
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the following scenario: We observe that A is false. We find B becomes 
less plausible, although no conclusion can be drawn from classical logic. 
We observe that B is true. It seems A becomes more plausible. We use 
this form of reasoning daily. We are waiting for a friend, and consider 
three possibilities: H1, she is on time; H2, she has been delayed by traffic; 
and H3, she has been abducted by aliens. When we observe our friend 
is late, we must logically rule out H1. We also tend to consider H2 to be 
more likely, though we are not logically required to do so. Finally, we may 
consider H3 to be possible, but we continue to consider it quite unlikely. 
How do we conclude H2 is the most plausible answer? Seen in this way, 
probability theory can be considered a generalization of Boolean logic. In 
the context of machine learning, it is often applied in this way to formalize 
the design of automated reasoning systems. Further arguments about how 
probability theory is the foundation of reasoning systems can be found 
in Pearl (1988). 

The philosophical basis of probability and how it should be somehow 
related to what we think should be true (in the logical sense) was studied 
by Cox (Jaynes, 2003). Another way to think about it is that if we are 
precise about our common sense we end up constructing probabilities. 
E. T. Jaynes (1922-1998) identified three mathematical criteria, which 
must apply to all plausibilities: 


1. The degrees of plausibility are represented by real numbers. 
2. These numbers must be based on the rules of common sense. 
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Figure 6.1 A mind 
map of the concepts 
related to random 
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chapter. 
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the discrete true and 
false values of truth 
to continuous 
plausibilities” 
(Jaynes, 2003). 
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3. The resulting reasoning must be consistent, with the three following 
meanings of the word “consistent”: 


(a) Consistency or non-contradiction: When the same result can be 
reached through different means, the same plausibility value must 
be found in all cases. 

(b) Honesty: All available data must be taken into account. 

(c) Reproducibility: If our state of knowledge about two problems are 
the same, then we must assign the same degree of plausibility to 
both of them. 


The Cox—Jaynes theorem proves these plausibilities to be sufficient to 
define the universal mathematical rules that apply to plausibility p, up to 
transformation by an arbitrary monotonic function. Crucially, these rules 
are the rules of probability. 


Remark. In machine learning and statistics, there are two major interpre- 
tations of probability: the Bayesian and frequentist interpretations (Bishop, 
2006; Efron and Hastie, 2016). The Bayesian interpretation uses probabil- 
ity to specify the degree of uncertainty that the user has about an event. It 
is sometimes referred to as “subjective probability” or “degree of belief”. 
The frequentist interpretation considers the relative frequencies of events 
of interest to the total number of events that occurred. The probability of 
an event is defined as the relative frequency of the event in the limit when 
one has infinite data. © 


Some machine learning texts on probabilistic models use lazy notation 
and jargon, which is confusing. This text is no exception. Multiple distinct 
concepts are all referred to as “probability distribution”, and the reader 
has to often disentangle the meaning from the context. One trick to help 
make sense of probability distributions is to check whether we are trying 
to model something categorical (a discrete random variable) or some- 
thing continuous (a continuous random variable). The kinds of questions 
we tackle in machine learning are closely related to whether we are con- 
sidering categorical or continuous models. 


6.1.2 Probability and Random Variables 


There are three distinct ideas that are often confused when discussing 
probabilities. First is the idea of a probability space, which allows us to 
quantify the idea of a probability. However, we mostly do not work directly 
with this basic probability space. Instead, we work with random variables 
(the second idea), which transfers the probability to a more convenient 
(often numerical) space. The third idea is the idea of a distribution or law 
associated with a random variable. We will introduce the first two ideas 
in this section and expand on the third idea in Section 6.2. 

Modern probability is based on a set of axioms proposed by Kolmogorov 
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(Grinstead and Snell, 1997; Jaynes, 2003) that introduce the three con- 
cepts of sample space, event space, and probability measure. The prob- 
ability space models a real-world process (referred to as an experiment) 
with random outcomes. 


The sample space Q 
The sample space is the set of all possible outcomes of the experiment, 
usually denoted by Q. For example, two successive coin tosses have 
a sample space of {hh, tt, ht, th}, where “h” denotes “heads” and “t” 
denotes “tails”. 

The event space A 
The event space is the space of potential results of the experiment. A 
subset A of the sample space Q is in the event space A if at the end 
of the experiment we can observe whether a particular outcome w € Q 
is in A. The event space A is obtained by considering the collection of 
subsets of Q, and for discrete probability distributions (Section 6.2.1) 
A is often the power set of Q. 

The probability P 
With each event A € A, we associate a number P(A) that measures the 
probability or degree of belief that the event will occur. P(A) is called 
the probability of A. 


The probability of a single event must lie in the interval [0, 1], and the 
total probability over all outcomes in the sample space Q must be 1, i.e., 
P(Q) = 1. Given a probability space (Q, A, P), we want to use it to model 
some real-world phenomenon. In machine learning, we often avoid explic- 
itly referring to the probability space, but instead refer to probabilities on 
quantities of interest, which we denote by 7. In this book, we refer to T 
as the target space and refer to elements of 7 as states. We introduce a 
function X : Q — T that takes an element of Q (an outcome) and returns 
a particular quantity of interest x, a value in 7. This association/mapping 
from Q to T is called a random variable. For example, in the case of tossing 
two coins and counting the number of heads, a random variable X maps 
to the three possible outcomes: X (hh) = 2, X(ht) = 1, X (th) = 1, and 
X (tt) = 0. In this particular case, T = {0, 1, 2}, and it is the probabilities 
on elements of 7 that we are interested in. For a finite sample space Q and 
finite 7, the function corresponding to a random variable is essentially a 
lookup table. For any subset S C T, we associate Px(S) € [0,1] (the 
probability) to a particular event occurring corresponding to the random 
variable X. Example 6.1 provides a concrete illustration of the terminol- 
ogy. 

Remark. The aforementioned sample space NQ unfortunately is referred 
to by different names in different books. Another common name for Q 
is “state space” (Jacod and Protter, 2004), but state space is sometimes 
reserved for referring to states in a dynamical system (Hasselblatt and 
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Katok, 2003). Other names sometimes used to describe ( are: “sample 


0 6 


description space”, “possibility space,” and “event space”. ro 


Example 6.1 
We assume that the reader is already familiar with computing probabil- 
ities of intersections and unions of sets of events. A gentler introduction 
to probability with many examples can be found in chapter 2 of Walpole 
et al. (2011). 

Consider a statistical experiment where we model a funfair game con- 
sisting of drawing two coins from a bag (with replacement). There are 
coins from USA (denoted as $) and UK (denoted as £) in the bag, and 
since we draw two coins from the bag, there are four outcomes in total. 
The state space or sample space Q of this experiment is then ($, $), ($, 
£), (£, $), &, £). Let us assume that the composition of the bag of coins is 
such that a draw returns at random a $ with probability 0.3. 

The event we are interested in is the total number of times the repeated 
draw returns $. Let us define a random variable X that maps the sample 
space 2. to 7, which denotes the number of times we draw $ out of the 
bag. We can see from the preceding sample space we can get zero $, one $, 
or two $s, and therefore 7 = {0, 1,2}. The random variable X (a function 
or lookup table) can be represented as a table like the following: 


X (($, $)) = 2 (6.1) 
X(($, £)) =1 (6.2) 
X((£,$)) =1 (6.3) 
X((£,£)) =0. (6.4) 


Since we return the first coin we draw before drawing the second, this 
implies that the two draws are independent of each other, which we will 
discuss in Section 6.4.5. Note that there are two experimental outcomes, 
which map to the same event, where only one of the draws returns $. 
Therefore, the probability mass function (Section 6.2.1) of X is given by 
P(X = 2) = P((8,$)) 
= P($) - P($) 
= 0.3 : 0.3 = 0.09 (6.5) 
P(X =1) = P(($, £) U (£, $)) 
= P(($, £)) + P((£,$)) 
= 0.3 - (1 — 0.3) + (1 — 0.3) - 0.3 = 0.42 (6.6) 
PXO ERNE 
= Poe P(e) 
= (1 — 0.3) - (1 — 0.3) = 0.49. (6.7) 
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In the calculation, we equated two different concepts, the probability 
of the output of X and the probability of the samples in Q. For example, 
in (6.7) we say P(X = 0) = P((£, £)). Consider the random variable 
X : Q — T and a subset S C 7 (for example, a single element of 7, 
such as the outcome that one head is obtained when tossing two coins). 
Let X~'(S) be the pre-image of S by X, i.e., the set of elements of 2 that 
map to S under X; {w € Q : X(w) € S}. One way to understand the 
transformation of probability from events in Q via the random variable 
X is to associate it with the probability of the pre-image of S (Jacod and 
Protter, 2004). For S C T, we have the notation 


Px(S) = P(X € 8) = P(X7'(S)) = P{w EQ: X(w)e S}. (6.8) 


The left-hand side of (6.8) is the probability of the set of possible outcomes 
(e.g., number of $ = 1) that we are interested in. Via the random variable 
X, which maps states to outcomes, we see in the right-hand side of (6.8) 
that this is the probability of the set of states (in Q) that have the property 
(e.g., $£, £$). We say that a random variable X is distributed according 
to a particular probability distribution Px, which defines the probability 
mapping between the event and the probability of the outcome of the 
random variable. In other words, the function Px or equivalently Po X~! 
is the law or distribution of random variable X. 


Remark. The target space, that is, the range 7 of the random variable X, 
is used to indicate the kind of probability space, i.e., a 7 random variable. 
When 7 is finite or countably infinite, this is called a discrete random 
variable (Section 6.2.1). For continuous random variables (Section 6.2.2), 
we only consider T = R or T = RP. © 


6.1.3 Statistics 


Probability theory and statistics are often presented together, but they con- 
cern different aspects of uncertainty. One way of contrasting them is by the 
kinds of problems that are considered. Using probability, we can consider 
a model of some process, where the underlying uncertainty is captured 
by random variables, and we use the rules of probability to derive what 
happens. In statistics, we observe that something has happened and try 
to figure out the underlying process that explains the observations. In this 
sense, machine learning is close to statistics in its goals to construct a 
model that adequately represents the process that generated the data. We 
can use the rules of probability to obtain a “best-fitting” model for some 
data. 

Another aspect of machine learning systems is that we are interested 
in generalization error (see Chapter 8). This means that we are actually 
interested in the performance of our system on instances that we will 
observe in future, which are not identical to the instances that we have 
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seen so far. This analysis of future performance relies on probability and 
statistics, most of which is beyond what will be presented in this chapter. 
The interested reader is encouraged to look at the books by Boucheron 
et al. (2013) and Shalev-Shwartz and Ben-David (2014). We will see more 
about statistics in Chapter 8. 


6.2 Discrete and Continuous Probabilities 


Let us focus our attention on ways to describe the probability of an event 
as introduced in Section 6.1. Depending on whether the target space is dis- 
crete or continuous, the natural way to refer to distributions is different. 
When the target space 7 is discrete, we can specify the probability that a 
random variable X takes a particular value x € 7, denoted as P(X = x). 
The expression P(X = x) for a discrete random variable X is known as 
the probability mass function. When the target space 7 is continuous, e.g., 
the real line R, it is more natural to specify the probability that a random 
variable X is in an interval, denoted by P(a < X < b) for a < b. By con- 
vention, we specify the probability that a random variable X is less than 
a particular value x, denoted by P(X < x). The expression P(X < x) for 
a continuous random variable X is known as the cumulative distribution 
function. We will discuss continuous random variables in Section 6.2.2. 
We will revisit the nomenclature and contrast discrete and continuous 
random variables in Section 6.2.3. 


Remark. We will use the phrase univariate distribution to refer to distribu- 
tions of a single random variable (whose states are denoted by non-bold 
x). We will refer to distributions of more than one random variable as 
multivariate distributions, and will usually consider a vector of random 
variables (whose states are denoted by bold æ). © 


6.2.1 Discrete Probabilities 


When the target space is discrete, we can imagine the probability distri- 
bution of multiple random variables as filling out a (multidimensional) 
array of numbers. Figure 6.2 shows an example. The target space of the 
joint probability is the Cartesian product of the target spaces of each of 
the random variables. We define the joint probability as the entry of both 
values jointly 

Nij 
N ’ 
where nj is the number of events with state x; and y; and N the total 
number of events. The joint probability is the probability of the intersec- 
tion of both events, that is, P(X = z:;,Y = y;) = P(X = ti NY = y;). 
Figure 6.2 illustrates the probability mass function (pmf) of a discrete prob- 
ability distribution. For two random variables X and Y, the probability 
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that X = x and Y = y is (lazily) written as p(x, y) and is called the joint 
probability. One can think of a probability as a function that takes state 
x and y and returns a real number, which is the reason we write p(x, y). 
The marginal probability that X takes the value x irrespective of the value 
of random variable Y is (lazily) written as p(x). We write X ~ p(x) to 
denote that the random variable X is distributed according to p(x). If we 
consider only the instances where X = zx, then the fraction of instances 
(the conditional probability) for which Y = y is written (lazily) as p(y | £). 


Example 6.2 

Consider two random variables X and Y, where X has five possible states 
and Y has three possible states, as shown in Figure 6.2. We denote by nij 
the number of events with state X = x; and Y = yj, and denote by 
N the total number of events. The value c; oe sum of the individual 


frequencies for the ith column, that is, c; = > j=1 Pij- Similarly, the value 


r; is the row sum, that is, r; = Sa Nij. Using these definitions, we can 
compactly express the distribution of X and Y. 

The probability distribution of each random variable, the marginal 
probability, can be seen as the sum over a row or column 


3 
Ci De Tij 


P N =) = — = —— sT 
( ga) N WN (6.10) 
and 
5 
ee ey eee Deiat Mig 
a a (6.11) 


where c; and r; are the ith column and jth row of the probability table, 
respectively. By convention, for discrete random variables with a finite 
number of events, we assume that probabilties sum up to one, that is, 


3 
> POX ST) S and ) ROE SE m = (6.12) 
=i 


The conditional probability is the fraction of a row or column in a par- 
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ticular cell. For example, the conditional probability of Y given X is 
P(Y =y,|X=a,) =~, (6.13) 
c 


4 





and the conditional probability of X given Y is 





Nij 


j 


In machine learning, we use discrete probability distributions to model 
categorical variables, i.e., variables that take a finite set of unordered val- 
ues. They could be categorical features, such as the degree taken at uni- 
versity when used for predicting the salary of a person, or categorical la- 
bels, such as letters of the alphabet when doing handwriting recognition. 
Discrete distributions are also often used to construct probabilistic models 
that combine a finite number of continuous distributions (Chapter 11). 


6.2.2 Continuous Probabilities 


We consider real-valued random variables in this section, i.e., we consider 
target spaces that are intervals of the real line R. In this book, we pretend 
that we can perform operations on real random variables as if we have dis- 
crete probability spaces with finite states. However, this simplification is 
not precise for two situations: when we repeat something infinitely often, 
and when we want to draw a point from an interval. The first situation 
arises when we discuss generalization errors in machine learning (Chap- 
ter 8). The second situation arises when we want to discuss continuous 
distributions, such as the Gaussian (Section 6.5). For our purposes, the 
lack of precision allows for a briefer introduction to probability. 


Remark. In continuous spaces, there are two additional technicalities, 
which are counterintuitive. First, the set of all subsets (used to define 
the event space A in Section 6.1) is not well behaved enough. A needs 
to be restricted to behave well under set complements, set intersections, 
and set unions. Second, the size of a set (which in discrete spaces can be 
obtained by counting the elements) turns out to be tricky. The size of a 
set is called its measure. For example, the cardinality of discrete sets, the 
length of an interval in R, and the volume of a region in R? are all mea- 
sures. Sets that behave well under set operations and additionally have 
a topology are called a Borel o-algebra. Betancourt details a careful con- 
struction of probability spaces from set theory without being bogged down 
in technicalities; see https: //tinyurl.com/yb3t6mfd. For a more pre- 
cise construction, we refer to Billingsley (1995) and Jacod and Protter 
(2004). 

In this book, we consider real-valued random variables with their cor- 
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responding Borel o-algebra. We consider random variables with values in 
R? to be a vector of real-valued random variables. Q 


Definition 6.1 (Probability Density Function). A function f : R? — R is 
called a probability density function (pdf) if 


1. V2 ER? : f(a) > 0 
2. Its integral exists and 


J f(x)dx =1. (6.15) 
RP 


For probability mass functions (pmf) of discrete random variables, the 
integral in (6.15) is replaced with a sum (6.12). 


Observe that the probability density function is any function f that is 
non-negative and integrates to one. We associate a random variable X 
with this function f by 


b 
PGS Xe f toüs: (6.16) 


where a,b € R and x € R are outcomes of the continuous random vari- 
able X. States x € R” are defined analogously by considering a vector 
of x € R. This association (6.16) is called the law or distribution of the 
random variable X. 


Remark. In contrast to discrete random variables, the probability of a con- 


tinuous random variable X taking a particular value P(X = zx) is zero. 
This is like trying to specify an interval in (6.16) where a = b. ro 


Definition 6.2 (Cumulative Distribution Function). A cumulative distribu- 
tion function (cdf) of a multivariate real-valued random variable X with 
states x € R? is given by 


Fx(x) = P(X: < z1,..., Xp <S zp), (6.17) 


where X = [X1,...,Xp]', x = [zı,...,£p]', and the right-hand side 
represents the probability that random variable X; takes the value smaller 
than or equal to zx;. 


The cdf can be expressed also as the integral of the probability density 
function f(a) so that 


Fy (x) =| af f(a1,.--,2Zp)dz1-+-dzp. (6.18) 


Remark. We reiterate that there are in fact two distinct concepts when 
talking about distributions. First is the idea of a pdf (denoted by f(zx)), 
which is a nonnegative function that sums to one. Second is the law of a 
random variable X, that is, the association of a random variable X with 


the pdf f(x). © 
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(a) Discrete distribution (b) Continuous distribution 


For most of this book, we will not use the notation f(x) and Fy(x) as 
we mostly do not need to distinguish between the pdf and cdf. However, 
we will need to be careful about pdfs and cdfs in Section 6.7. 


6.2.3 Contrasting Discrete and Continuous Distributions 


Recall from Section 6.1.2 that probabilities are positive and the total prob- 
ability sums up to one. For discrete random variables (see (6.12)), this 
implies that the probability of each state must lie in the interval [0, 1]. 
However, for continuous random variables the normalization (see (6.15)) 
does not imply that the value of the density is less than or equal to 1 for 
all values. We illustrate this in Figure 6.3 using the uniform distribution 
for both discrete and continuous random variables. 


Example 6.3 
We consider two examples of the uniform distribution, where each state is 
equally likely to occur. This example illustrates some differences between 
discrete and continuous probability distributions. 

Let Z be a discrete uniform random variable with three states {z = 
—1.1,z = 0.3, z = 1.5}. The probability mass function can be represented 
as a table of probability values: 


z —1.1 0.3 1.5 


Alternatively, we can think of this as a graph (Figure 6.3(a)), where we 
use the fact that the states can be located on the x-axis, and the y-axis 
represents the probability of a particular state. The y-axis in Figure 6.3(a) 
is deliberately extended so that is it the same as in Figure 6.3(b). 

Let X be a continuous random variable taking values in the range 0.9 < 
X < 1.6, as represented by Figure 6.3(b). Observe that the height of the 
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Type “Point probability” “Interval probability” 





Discrete P(X =2) Not applicable 
Probability mass function 





Continuous p(x) P(X <2) 
Probability density function Cumulative distribution function 


density can be greater than 1. However, it needs to hold that 


1.6 
i Dieter (6.19) 
0 


.9 


Remark. There is an additional subtlety with regards to discrete prob- 
ability distributions. The states z1,...,Zzą do not in principle have any 
structure, i.e., there is usually no way to compare them, for example 
zı = red, z2 = green, z3 = blue. However, in many machine learning 
applications discrete states take numerical values, e.g., z1 = —1.1, z2 = 
0.3, z3 = 1.5, where we could say zı < z2 < z3. Discrete states that as- 
sume numerical values are particularly useful because we often consider 
expected values (Section 6.4.1) of random variables. 4 


Unfortunately, machine learning literature uses notation and nomen- 
clature that hides the distinction between the sample space Q, the target 
space 7, and the random variable X. For a value x of the set of possible 
outcomes of the random variable X, i.e., x € T, p(x) denotes the prob- 
ability that random variable X has the outcome zv. For discrete random 
variables, this is written as P(X = x), which is known as the probabil- 
ity mass function. The pmf is often referred to as the “distribution”. For 
continuous variables, p(x) is called the probability density function (often 
referred to as a density). To muddy things even further, the cumulative 
distribution function P(X < 2) is often also referred to as the “distribu- 
tion”. In this chapter, we will use the notation X to refer to both univariate 
and multivariate random variables, and denote the states by x and æ re- 
spectively. We summarize the nomenclature in Table 6.1. 


Remark. We will be using the expression “probability distribution” not 
only for discrete probability mass functions but also for continuous proba- 
bility density functions, although this is technically incorrect. In line with 
most machine learning literature, we also rely on context to distinguish 
the different uses of the phrase probability distribution. > 


6.3 Sum Rule, Product Rule, and Bayes’ Theorem 


We think of probability theory as an extension to logical reasoning. As we 
discussed in Section 6.1.1, the rules of probability presented here follow 
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naturally from fulfilling the desiderata (Jaynes, 2003, chapter 2). Prob- 
abilistic modeling (Section 8.4) provides a principled foundation for de- 
signing machine learning methods. Once we have defined probability dis- 
tributions (Section 6.2) corresponding to the uncertainties of the data and 
our problem, it turns out that there are only two fundamental rules, the 
sum rule and the product rule. 

Recall from (6.9) that p(x, y) is the joint distribution of the two ran- 
dom variables x, y. The distributions p(a) and p(y) are the correspond- 
ing marginal distributions, and p(y | x) is the conditional distribution of y 
given a. Given the definitions of the marginal and conditional probability 
for discrete and continuous random variables in Section 6.2, we can now 
present the two fundamental rules in probability theory. 

The first rule, the sum rule, states that 


S p(x, y) if y is discrete 
p(2) =< ¥& , (6.20) 
p(x, y)dy if y is continuous 
y 

where y are the states of the target space of random variable Y. This 
means that we sum out (or integrate out) the set of states y of the random 
variable Y. The sum rule is also known as the marginalization property. 
The sum rule relates the joint distribution to a marginal distribution. In 
general, when the joint distribution contains more than two random vari- 
ables, the sum rule can be applied to any subset of the random variables, 
resulting in a marginal distribution of potentially more than one random 
variable. More concretely, if x = [z1,..., £p]! , we obtain the marginal 


plz) = [om ppp dag (6.21) 


by repeated application of the sum rule where we integrate/sum out all 
random variables except x;, which is indicated by \%, which reads “all 
except i.” 


Remark. Many of the computational challenges of probabilistic modeling 
are due to the application of the sum rule. When there are many variables 
or discrete variables with many states, the sum rule boils down to per- 
forming a high-dimensional sum or integral. Performing high-dimensional 
sums or integrals is generally computationally hard, in the sense that there 
is no known polynomial-time algorithm to calculate them exactly. > 


The second rule, known as the product rule, relates the joint distribution 
to the conditional distribution via 


P(x, y) = ply | x)p(x). (6.22) 


The product rule can be interpreted as the fact that every joint distribu- 
tion of two random variables can be factorized (written as a product) 
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of two other distributions. The two factors are the marginal distribu- 
tion of the first random variable p(x), and the conditional distribution 
of the second random variable given the first p(y | a). Since the ordering 
of random variables is arbitrary in p(a, y), the product rule also implies 
p(x, y) = p(a@| y)p(y). To be precise, (6.22) is expressed in terms of the 
probability mass functions for discrete random variables. For continuous 
random variables, the product rule is expressed in terms of the probability 
density functions (Section 6.2.3). 

In machine learning and Bayesian statistics, we are often interested in 
making inferences of unobserved (latent) random variables given that we 
have observed other random variables. Let us assume we have some prior 
knowledge p(x) about an unobserved random variable x and some rela- 
tionship p(y |x) between x and a second random variable y, which we 
can observe. If we observe y, we can use Bayes’ theorem to draw some 
conclusions about x given the observed values of y. Bayes’ theorem (also 
Bayes’ rule or Bayes’ law) 


likelihood prior 


—_. 
p(y | £) p(x) 


p(x |y) = (6.23) 
=y p(y) 
posterior ae 
is a direct consequence of the product rule in (6.22) since 
p(x, y) = p(x | y)ply) (6.24) 
and 
p(x, y) = ply | £)p(x) (6.25) 
so that 
PY | TIPIT 
plæl yl) =p lepa) => paly) = HERE, 62) 


In (6.23), p(a) is the prior, which encapsulates our subjective prior 
knowledge of the unobserved (latent) variable x before observing any 
data. We can choose any prior that makes sense to us, but it is critical to 
ensure that the prior has a nonzero pdf (or pmf) on all plausible x, even 
if they are very rare. 

The likelihood p(y | x) describes how æ and y are related, and in the 
case of discrete probability distributions, it is the probability of the data y 
if we were to know the latent variable æ. Note that the likelihood is not a 
distribution in æ, but only in y. We call p(y | x) either the “likelihood of 
x (given y)” or the “probability of y given x” but never the likelihood of 
y (MacKay, 2003). 

The posterior p(æ |y) is the quantity of interest in Bayesian statistics 
because it expresses exactly what we are interested in, i.e., what we know 
about x after having observed y. 
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The quantity 


p(y) := i p(y |æ)p(æ)de = Ex [p(y |x) (6.27) 


is the marginal likelihood/evidence. The right-hand side of (6.27) uses the 
expectation operator which we define in Section 6.4.1. By definition, the 
marginal likelihood integrates the numerator of (6.23) with respect to the 
latent variable a. Therefore, the marginal likelihood is independent of 
x, and it ensures that the posterior p(a | y) is normalized. The marginal 
likelihood can also be interpreted as the expected likelihood where we 
take the expectation with respect to the prior p(x). Beyond normalization 
of the posterior, the marginal likelihood also plays an important role in 
Bayesian model selection, as we will discuss in Section 8.6. Due to the 
integration in (8.44), the evidence is often hard to compute. 

Bayes’ theorem (6.23) allows us to invert the relationship between œ 
and y given by the likelihood. Therefore, Bayes’ theorem is sometimes 
called the probabilistic inverse. We will discuss Bayes’ theorem further in 
Section 8.4. 


Remark. In Bayesian statistics, the posterior distribution is the quantity 
of interest as it encapsulates all available information from the prior and 
the data. Instead of carrying the posterior around, it is possible to focus 
on some statistic of the posterior, such as the maximum of the posterior, 
which we will discuss in Section 8.3. However, focusing on some statistic 
of the posterior leads to loss of information. If we think in a bigger con- 
text, then the posterior can be used within a decision-making system, and 
having the full posterior can be extremely useful and lead to decisions that 
are robust to disturbances. For example, in the context of model-based re- 
inforcement learning, Deisenroth et al. (2015) show that using the full 
posterior distribution of plausible transition functions leads to very fast 
(data/sample efficient) learning, whereas focusing on the maximum of 
the posterior leads to consistent failures. Therefore, having the full pos- 
terior can be very useful for a downstream task. In Chapter 9, we will 
continue this discussion in the context of linear regression. ro 


6.4 Summary Statistics and Independence 


We are often interested in summarizing sets of random variables and com- 
paring pairs of random variables. A statistic of a random variable is a de- 
terministic function of that random variable. The summary statistics of a 
distribution provide one useful view of how a random variable behaves, 
and as the name suggests, provide numbers that summarize and charac- 
terize the distribution. We describe the mean and the variance, two well- 
known summary statistics. Then we discuss two ways to compare a pair 
of random variables: first, how to say that two random variables are inde- 
pendent; and second, how to compute an inner product between them. 
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6.4.1 Means and Covariances 


Mean and (co)variance are often useful to describe properties of probabil- 
ity distributions (expected values and spread). We will see in Section 6.6 
that there is a useful family of distributions (called the exponential fam- 
ily), where the statistics of the random variable capture all possible infor- 
mation. 

The concept of the expected value is central to machine learning, and 
the foundational concepts of probability itself can be derived from the 
expected value (Whittle, 2000). 


Definition 6.3 (Expected Value). The expected value of a function g : R > 
R of a univariate continuous random variable X ~ p(x) is given by 


Bxl) = f s)ple)dr. (6.28) 
x 
Correspondingly, the expected value of a function g of a discrete random 
variable X ~ p(x) is given by 


Ex(9(x)] = X` g(æ)p(2), (6.29) 


TEX 


where X is the set of possible outcomes (the target space) of the random 
variable X. 


In this section, we consider discrete random variables to have numerical 
outcomes. This can be seen by observing that the function g takes real 
numbers as inputs. 


Remark. We consider multivariate random variables X as a finite vector 
of univariate random variables [X;,..., Xp]! . For multivariate random 
variables, we define the expected value element wise 


Ex, [g(xı)] 
Ex(g(x)] = eR”. (6.30) 


Ex,[9(zp)| 


where the subscript Ex, indicates that we are taking the expected value 
with respect to the dth element of the vector x. © 


Definition 6.3 defines the meaning of the notation Ex as the operator 
indicating that we should take the integral with respect to the probabil- 
ity density (for continuous distributions) or the sum over all states (for 
discrete distributions). The definition of the mean (Definition 6.4), is a 
special case of the expected value, obtained by choosing g to be the iden- 
tity function. 


Definition 6.4 (Mean). The mean of a random variable X with states 
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x € R? is an average and is defined as 


Ex, [x] 
Ex [a] = ; ER?, (6.31) 
Ex,|[%p] 
where 
i: Lap(La)drg if X is a continuous random variable 
E = ee 
xa [2a] 5 ziplza = x;) if X is a discrete random variable 
LEX 
(6.32) 
for d = 1,..., D, where the subscript d indicates the corresponding di- 


mension of «. The integral and sum are over the states ¥ of the target 
space of the random variable X. 


In one dimension, there are two other intuitive notions of “average”, 
which are the median and the mode. The median is the “middle” value if 
we sort the values, i.e., 50% of the values are greater than the median and 
50% are smaller than the median. This idea can be generalized to contin- 
uous values by considering the value where the cdf (Definition 6.2) is 0.5. 
For distributions, which are asymmetric or have long tails, the median 
provides an estimate of a typical value that is closer to human intuition 
than the mean value. Furthermore, the median is more robust to outliers 
than the mean. The generalization of the median to higher dimensions is 
non-trivial as there is no obvious way to “sort” in more than one dimen- 
sion (Hallin et al., 2010; Kong and Mizera, 2012). The mode is the most 
frequently occurring value. For a discrete random variable, the mode is 
defined as the value of x having the highest frequency of occurrence. For 
a continuous random variable, the mode is defined as a peak in the density 
p(x). A particular density p(x) may have more than one mode, and fur- 
thermore there may be a very large number of modes in high-dimensional 
distributions. Therefore, finding all the modes of a distribution can be 
computationally challenging. 


Example 6.4 
Consider the two-dimensional distribution illustrated in Figure 6.4: 


p(x) = 0.4N (« | |) +0.6N € | fo o i) a 


We will define the Gaussian distribution M (u, o°) in Section 6.5. Also 
shown is its corresponding marginal distribution in each dimension. Ob- 
serve that the distribution is bimodal (has two modes), but one of the 
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marginal distributions is unimodal (has one mode). The horizontal bi- 
modal univariate distribution illustrates that the mean and median can 
be different from each other. While it is tempting to define the two- 
dimensional median to be the concatenation of the medians in each di- 
mension, the fact that we cannot define an ordering of two-dimensional 
points makes it difficult. When we say “cannot define an ordering”, we 
mean that there is more than one way to define the relation < so that 


ol < fs] 


Figure 6.4 
© Mean Illustration of the 
* ~~ Modes mean, mode, and 
@ Median 


median for a 
two-dimensional 
dataset, as well as 
its marginal 
densities. 





Remark. The expected value (Definition 6.3) is a linear operator. For ex- 
ample, given a real-valued function f(x) = ag(a)+bh(a) where a,b © R 
and x € R?, we obtain 


= | f@)pla)ae (6.34a) 
= i [ag(a) + bh(xx)|p(x)dx (6.34b) 
= afs (æ)p(æ)dz + b eg h(w (6.340) 
= aEx[g(æ)] + dEx[h(æ (6.34d) 
© 


For two random variables, we may wish to characterize their correspon- 
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dence to each other. The covariance intuitively represents the notion of 
how dependent random variables are to one another. 


Definition 6.5 (Covariance (Univariate)). The covariance between two 
univariate random variables X,Y € R is given by the expected product 
of their deviations from their respective means, i.e., 


Covx, ylz, y] := Ex,y [(x — Ex[2])(y- Ey [y])] . 


Remark. When the random variable associated with the expectation or 
covariance is clear by its arguments, the subscript is often suppressed (for 
example, Ex [z] is often written as E[z]). © 


(6.35) 


By using the linearity of expectations, the expression in Definition 6.5 
can be rewritten as the expected value of the product minus the product 
of the expected values, i.e., 


Cov[z, y] = E[zy] — E[z]E[y] . (6.36) 


The covariance of a variable with itself Cov|z, x] is called the variance and 
is denoted by Vx [x]. The square root of the variance is called the standard 
deviation and is often denoted by o(2). The notion of covariance can be 
generalized to multivariate random variables. 


Definition 6.6 (Covariance (Multivariate)). If we consider two multivari- 
ate random variables X and Y with states x € R? and y € R® respec- 
tively, the covariance between X and Y is defined as 


Cov[x, y] = Elay"] — E[x]E[y]' = Cov[y,a]" ¢R?*”. (6.37) 


Definition 6.6 can be applied with the same multivariate random vari- 
able in both arguments, which results in a useful concept that intuitively 
captures the “spread” of a random variable. For a multivariate random 
variable, the variance describes the relation between individual dimen- 
sions of the random variable. 


Definition 6.7 (Variance). The variance of a random variable X with 
states x € RP and a mean vector u € R? is defined as 


Vx [a] = Covy |x, x] (6.38a) 
= Ex((a@ — p)(a@ — w)']) = Ex|xx"|—Ex[x]Ex[x]" (6.38b) 
Cov[z1, xı] Cov|x, xə] Cov[21, Xp] 
Cov[z2, xı] Cov[z2, xə] ... Cov[xe, xp] 
= g (6.389 
E Cov|zp, £p] 


The D x D matrix in (6.38c) is called the covariance matrix of the mul- 
tivariate random variable X. The covariance matrix is symmetric and pos- 
itive semidefinite and tells us something about the spread of the data. On 
its diagonal, the covariance matrix contains the variances of the marginals 
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(a) x and y are negatively correlated. (b) x and y are positively correlated. 


pe) = [om ash ip AD 4 (6.39) 


where “\2” denotes “all variables but i”. The off-diagonal entries are the 
cross-covariance terms Cov|z;, x;] for i,j = 1,..., D, i£ j. 

Remark. In this book, we generally assume that covariance matrices are 
positive definite to enable better intuition. We therefore do not discuss 
corner cases that result in positive semidefinite (low-rank) covariance ma- 
trices. © 


When we want to compare the covariances between different pairs of 
random variables, it turns out that the variance of each random variable 
affects the value of the covariance. The normalized version of covariance 
is called the correlation. 


Definition 6.8 (Correlation). The correlation between two random vari- 
ables X,Y is given by 


Cov[z, y] 
Vv V[2]VIy] 


The correlation matrix is the covariance matrix of standardized random 
variables, x/o(x). In other words, each random variable is divided by its 
standard deviation (the square root of the variance) in the correlation 
matrix. 

The covariance (and correlation) indicate how two random variables 
are related; see Figure 6.5. Positive correlation corr|x, y] means that when 
x grows, then y is also expected to grow. Negative correlation means that 
as x increases, then y decreases. 


corr|x, y] = € [-1,1]. (6.40) 


6.4.2 Empirical Means and Covariances 


The definitions in Section 6.4.1 are often also called the population mean 
and covariance, as it refers to the true statistics for the population. In ma- 
chine learning, we need to learn from empirical observations of data. Con- 
sider a random variable X. There are two conceptual steps to go from 
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population statistics to the realization of empirical statistics. First, we use 
the fact that we have a finite dataset (of size N) to construct an empirical 
statistic that is a function of a finite number of identical random variables, 


Xj 1,...,Xy. Second, we observe the data, that is, we look at the realiza- 
tion 71,...,%y of each of the random variables and apply the empirical 
statistic. 


Specifically, for the mean (Definition 6.4), given a particular dataset we 
can obtain an estimate of the mean, which is called the empirical mean or 
sample mean. The same holds for the empirical covariance. 


Definition 6.9 (Empirical Mean and Covariance). The empirical mean vec- 
tor is the arithmetic average of the observations for each variable, and it 
is defined as 


1x 

B= tn (6.41) 
n=1 

where x, € RP. 


Similar to the empirical mean, the empirical covariance matrix isa Dx D 
matrix 


(6.42) 


To compute the statistics for a particular dataset, we would use the 
realizations (observations) 2,,...,xj) and use (6.41) and (6.42). Em- 
pirical covariance matrices are symmetric, positive semidefinite (see Sec- 
tion 3.2.3). 


6.4.3 Three Expressions for the Variance 


We now focus on a single random variable X and use the preceding em- 
pirical formulas to derive three possible expressions for the variance. The 
following derivation is the same for the population variance, except that 
we need to take care of integrals. The standard definition of variance, cor- 
responding to the definition of covariance (Definition 6.5), is the expec- 
tation of the squared deviation of a random variable X from its expected 
value p, i.e., 


Vx [a] = Ex((a— 4)?]. 


The expectation in (6.43) and the mean yp = Ex(a) are computed us- 
ing (6.32), depending on whether X is a discrete or continuous random 
variable. The variance as expressed in (6.43) is the mean of a new random 
variable Z := (X — u}. 

When estimating the variance in (6.43) empirically, we need to resort 
to a two-pass algorithm: one pass through the data to calculate the mean 
p using (6.41), and then a second pass using this estimate / calculate the 


(6.43) 
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variance. It turns out that we can avoid two passes by rearranging the 
terms. The formula in (6.43) can be converted to the so-called raw-score 
formula for variance: 


Vx(2] = Ex[2"] — (Ex[z])’ - (6.44) 


The expression in (6.44) can be remembered as “the mean of the square 
minus the square of the mean”. It can be calculated empirically in one pass 
through data since we can accumulate zx; (to calculate the mean) and x? 
simultaneously, where x; is the ith observation. Unfortunately, if imple- 
mented in this way, it can be numerically unstable. The raw-score version 
of the variance can be useful in machine learning, e.g., when deriving the 
bias—variance decomposition (Bishop, 2006). 

A third way to understand the variance is that it is a sum of pairwise dif- 
ferences between all pairs of observations. Consider a sample z1,..., £y 
of realizations of random variable X, and we compute the squared differ- 
ence between pairs of x; and x;. By expanding the square, we can show 
that the sum of N? pairwise differences is the empirical variance of the 
observations: 


ee Le feos 
a w=1 w=1 


We see that (6.45) is twice the raw-score expression (6.44). This means 
that we can express the sum of pairwise distances (of which there are N? 
of them) as a sum of deviations from the mean (of which there are N). Ge- 
ometrically, this means that there is an equivalence between the pairwise 
distances and the distances from the center of the set of points. From a 
computational perspective, this means that by computing the mean (N 
terms in the summation), and then computing the variance (again N 
terms in the summation), we can obtain an expression (left-hand side 
of (6.45)) that has N? terms. 


6.4.4 Sums and Transformations of Random Variables 


We may want to model a phenomenon that cannot be well explained by 
textbook distributions (we introduce some in Sections 6.5 and 6.6), and 
hence may perform simple manipulations of random variables (such as 
adding two random variables). 

Consider two random variables X, Y with states x, y € RP. Then: 


Ela + y] = Efe] + Ely] (6.46) 
E[x — y] = Ela] — Ely] (6.47) 
Via + y] = Via] + V[y] + Cova, y] + Covly, x] (6.48) 
V[a — y] = Via] + V[y] — Cov[a, y] — Covly, a] . (6.49) 
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Mean and (co)variance exhibit some useful properties when it comes 
to affine transformation of random variables. Consider a random variable 
X with mean p and covariance matrix X and a (deterministic) affine 


transformation y = Aa -+ b of x. Then y is itself a random variable 
whose mean vector and covariance matrix are given by 
Ey|y] = Ex[Agv + 6] = AEx|x] +b = Au +b, (6.50) 


Vy ly] = Vx[Aw + b] = Vx[Az] = AVx[2]A' = AZSA', (6.51) 


respectively. Furthermore, 


Cov|ax, y] = Ela(Ax + b)"| — E[ajE[Ax + |" (6.52a) 
= Ela]b' + Elaa'|A' — wb! — py A (6.52b) 
= pb' — pb! + (Elax']— py")At (6.52c) 
Oe SAT: (6.52d) 


where © = E[aa'|— py! is the covariance of X. 


6.4.5 Statistical Independence 


Definition 6.10 (Independence). Two random variables X,Y are statis- 
tically independent if and only if 


p(z, y) = p(x)p(y) - (6.53) 


Intuitively, two random variables X and Y are independent if the value 
of y (once known) does not add any additional information about æ (and 
vice versa). If X,Y are (statistically) independent, then 


= p(y |x) = p(y) 

= p(x|y) = p(x) 

= Vxy[e2+ y] = Vx[a2] + Vy[y] 
. Covy y[x, y] =0 


The last point may not hold in converse, i.e., two random variables can 
have covariance zero but are not statistically independent. To understand 
why, recall that covariance measures only linear dependence. Therefore, 
random variables that are nonlinearly dependent could have covariance 
zero. 


Example 6.5 

Consider a random variable X with zero mean (Ex|z] = 0) and also 
Ex |[x?] = 0. Let y = x° (hence, Y is dependent on X) and consider the 
covariance (6.36) between X and Y. But this gives 


Cov[z, y] = Elzy] — E[z]E[y] = E[z*] =0. (6.54) 
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In machine learning, we often consider problems that can be mod- 
eled as independent and identically distributed (i.i.d.) random variables, 
X ,...,Xy. For more than two random variables, the word “indepen- 
dent” (Definition 6.10) usually refers to mutually independent random 
variables, where all subsets are independent (see Pollard (2002, chap- 
ter 4) and Jacod and Protter (2004, chapter 3)). The phrase “identically 
distributed” means that all the random variables are from the same distri- 
bution. 

Another concept that is important in machine learning is conditional 
independence. 


Definition 6.11 (Conditional Independence). Two random variables X 
and Y are conditionally independent given Z if and only if 


p(x, y|z)=p(x|z)p(y|z) forall zeZ, (6.55) 


where Z is the set of states of random variable Z. We write X IL Y | Z to 
denote that X is conditionally independent of Y given Z. 


Definition 6.11 requires that the relation in (6.55) must hold true for 
every value of z. The interpretation of (6.55) can be understood as “given 
knowledge about z, the distribution of x and y factorizes”. Independence 
can be cast as a special case of conditional independence if we write X IL 
Y |0. By using the product rule of probability (6.22), we can expand the 
left-hand side of (6.55) to obtain 


p(w, y|2z) = pal y,2)p(y| 2). (6.56) 

By comparing the right-hand side of (6.55) with (6.56), we see that p(y | z) 
appears in both of them so that 

plz|y,z) = plz |z). (6.57) 


Equation (6.57) provides an alternative definition of conditional indepen- 
dence, i.e., X 1L Y | Z. This alternative presentation provides the inter- 
pretation “given that we know z, knowledge about y does not change our 
knowledge of x”. 


6.4.6 Inner Products of Random Variables 


Recall the definition of inner products from Section 3.2. We can define an 
inner product between random variables, which we briefly describe in this 
section. If we have two uncorrelated random variables X, Y, then 


V(x + y] = V[z] + Vy]. (6.58) 


Since variances are measured in squared units, this looks very much like 
the Pythagorean theorem for right triangles c? = a? + b’. 

In the following, we see whether we can find a geometric interpreta- 
tion of the variance relation of uncorrelated random variables in (6.58). 
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Random variables can be considered vectors in a vector space, and we 
can define inner products to obtain geometric properties of random vari- 
ables (Eaton, 2007). If we define 


(X,Y) := Cov[z, y] (6.59) 


for zero mean random variables X and Y, we obtain an inner product. We 
see that the covariance is symmetric, positive definite, and linear in either 
argument. The length of a random variable is 


|X|] = /Covle, 2] = y Viz] = olz], 


i.e., its standard deviation. The “longer” the random variable, the more 
uncertain it is; and a random variable with length 0 is deterministic. 
If we look at the angle 6 between two random variables X,Y, we get 
(X,Y) _ 
IXI IYI 


(6.60) 


Cov[z, y] 
V[z]VIy] 


which is the correlation (Definition 6.8) between the two random vari- 
ables. This means that we can think of correlation as the cosine of the 
angle between two random variables when we consider them geometri- 
cally. We know from Definition 3.7 that X L Y <= (X,Y) =0. In our 
case, this means that X and Y are orthogonal if and only if Cov|zx, y] = 0, 
i.e., they are uncorrelated. Figure 6.6 illustrates this relationship. 


cos 0 = ; (6.61) 


Remark. While it is tempting to use the Euclidean distance (constructed 
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from the preceding definition of inner products) to compare probability 
distributions, it is unfortunately not the best way to obtain distances be- 
tween distributions. Recall that the probability mass (or density) is posi- 
tive and needs to add up to 1. These constraints mean that distributions 
live on something called a statistical manifold. The study of this space of 
probability distributions is called information geometry. Computing dis- 
tances between distributions are often done using Kullback-Leibler diver- 
gence, which is a generalization of distances that account for properties of 
the statistical manifold. Just like the Euclidean distance is a special case of 
a metric (Section 3.3), the Kullback-Leibler divergence is a special case of 
two more general classes of divergences called Bregman divergences and 
f-divergences. The study of divergences is beyond the scope of this book, 
and we refer for more details to the recent book by Amari (2016), one of 
the founders of the field of information geometry. Q 


6.5 Gaussian Distribution 


The Gaussian distribution is the most well-studied probability distribution 
for continuous-valued random variables. It is also referred to as the normal 
distribution. Its importance originates from the fact that it has many com- 
putationally convenient properties, which we will be discussing in the fol- 
lowing. In particular, we will use it to define the likelihood and prior for 
linear regression (Chapter 9), and consider a mixture of Gaussians for 
density estimation (Chapter 11). 

There are many other areas of machine learning that also benefit from 
using a Gaussian distribution, for example Gaussian processes, variational 
inference, and reinforcement learning. It is also widely used in other ap- 
plication areas such as signal processing (e.g., Kalman filter), control (e.g., 
linear quadratic regulator), and statistics (e.g., hypothesis testing). 
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(b) Multivariate (two-dimensional) Gaus- 
sian, viewed from top. The red cross shows 
the mean and the colored lines show the con- 
tour lines of the density. 


(a) Univariate (one-dimensional) Gaussian; 
The red cross shows the mean and the red 
line shows the extent of the variance. 


For a univariate random variable, the Gaussian distribution has a den- 


sity that is given by 
1 (x — p)? 
a S o NA F 
p(z |u,o°) = na exp ( o 


The multivariate Gaussian distribution is fully characterized by a mean 
vector u and a covariance matrix X and defined as 


(6.62) 


pla |u, D) = (21)7? |E]? exp (— }(æ — u) E (2 — u)), (6.63) 


where x € R?. We write p(x) = N(x |u, X) or X ~ N (p, X). Fig- 
ure 6.7 shows a bivariate Gaussian (mesh), with the corresponding con- 
tour plot. Figure 6.8 shows a univariate Gaussian and a bivariate Gaussian 
with corresponding samples. The special case of the Gaussian with zero 
mean and identity covariance, that is, y = 0 and © = J, is referred to as 
the standard normal distribution. 

Gaussians are widely used in statistical estimation and machine learn- 
ing as they have closed-form expressions for marginal and conditional dis- 
tributions. In Chapter 9, we use these closed-form expressions extensively 
for linear regression. A major advantage of modeling with Gaussian ran- 
dom variables is that variable transformations (Section 6.7) are often not 
needed. Since the Gaussian distribution is fully specified by its mean and 
covariance, we often can obtain the transformed distribution by applying 
the transformation to the mean and covariance of the random variable. 


6.5.1 Marginals and Conditionals of Gaussians are Gaussians 


In the following, we present marginalization and conditioning in the gen- 
eral case of multivariate random variables. If this is confusing at first read- 
ing, the reader is advised to consider two univariate random variables in- 
stead. Let X and Y be two multivariate random variables, that may have 
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different dimensions. To consider the effect of applying the sum rule of 
probability and the effect of conditioning, we explicitly write the Gaus- 
sian distribution in terms of the concatenated states [x', y'], 


p(x, y) =N (ele z sl). (6.64) 


Hy yz yy 


where &,,, = Cov[a#,a] and &,, = Cov[y, y] are the marginal covari- 
ance matrices of x and y, respectively, and X, = Cov[æ, y] is the cross- 
covariance matrix between x and y. 

The conditional distribution p(a | y) is also Gaussian (illustrated in Fig- 
ure 6.9(c)) and given by (derived in Section 2.3 of Bishop, 2006) 


p(e|y) =N (Me jy, Vely) (6.65) 
Hajy = Hs + oyyy (Y — My) (6.66) 


Note that in the computation of the mean in (6.66), the y-value is an 
observation and no longer random. 


Remark. The conditional Gaussian distribution shows up in many places, 
where we are interested in posterior distributions: 


= The Kalman filter (Kalman, 1960), one of the most central algorithms 
for state estimation in signal processing, does nothing but computing 
Gaussian conditionals of joint distributions (Deisenroth and Ohlsson, 
2011; Särkkä, 2013). 

Gaussian processes (Rasmussen and Williams, 2006), which are a prac- 
tical implementation of a distribution over functions. In a Gaussian pro- 
cess, we make assumptions of joint Gaussianity of random variables. By 
(Gaussian) conditioning on observed data, we can determine a poste- 
rior distribution over functions. 

Latent linear Gaussian models (Roweis and Ghahramani, 1999; Mur- 
phy, 2012), which include probabilistic principal component analysis 
(PPCA) (Tipping and Bishop, 1999). We will look at PPCA in more de- 
tail in Section 10.7. 


Q 


The marginal distribution p(a) of a joint Gaussian distribution p(x, y) 
(see (6.64)) is itself Gaussian and computed by applying the sum rule 
(6.20) and given by 


p(x) = Jve, y)dy =N(a|p,, Var) - (6.68) 


The corresponding result holds for p(y), which is obtained by marginaliz- 
ing with respect to x. Intuitively, looking at the joint distribution in (6.64), 
we ignore (i.e., integrate out) everything we are not interested in. This is 
illustrated in Figure 6.9(b). 
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Consider the bivariate Gaussian distribution (illustrated in Figure 6.9): 


p(@1, £2) =n ([3), ee l ; (6.69) 


We can compute the parameters of the univariate Gaussian, conditioned 
on z2 = —1, by applying (6.66) and (6.67) to obtain the mean and vari- 
ance respectively. Numerically, this is 


Hz |z2=—1 = 0 + (—1) - 0.2 - (—1 — 2) = 0.6 (6.70) 
and 
o a a EO E o (6.71) 


Therefore, the conditional Gaussian is given by 
p(z | z2 = —1) = N (0.6, 0.1). (6.72) 


The marginal distribution p(x,), in contrast, can be obtained by apply- 
ing (6.68), which is essentially using the mean and variance of the random 
variable x, giving us 


p(xı) = N (0, 0.3). (6.73) 
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6.5.2 Product of Gaussian Densities 


For linear regression (Chapter 9), we need to compute a Gaussian likeli- 
hood. Furthermore, we may wish to assume a Gaussian prior (Section 9.3). 
We apply Bayes’ Theorem to compute the posterior, which results in a mul- 
tiplication of the likelihood and the prior, that is, the multiplication of two 
Gaussian densities. The product of two Gaussians N (æ |a, A)N (x |b, B) 
is a Gaussian distribution scaled by a c € R, given by cM (x |e, C) with 


C=(A "4B y! (6.74) 
c=C(A `a + Bb) (6.75) 


c= (2n) "Z |A + B|? exp ( — t(a — b)T(A + B) (a — b)) . (6.76) 


The scaling constant c itself can be written in the form of a Gaussian 
density either in a or in b with an “inflated” covariance matrix A + B, 
ie, c=N(a|b, A+B) =N(b|a, A+B). 

Remark. For notation convenience, we will sometimes use N (æ |m, S) 
to describe the functional form of a Gaussian density even if a is not a 
random variable. We have just done this in the preceding demonstration 
when we wrote 


c=N(alb, A+B)=N(bla, A+B). (6.77) 


Here, neither a nor b are random variables. However, writing c in this way 
is more compact than (6.76). © 


6.5.3 Sums and Linear Transformations 


If X,Y are independent Gaussian random variables (i.e., the joint distri- 
bution is given as p(x, y) = p(x)p(y)) with p(x) = N (z| p., X+) and 
ply) = N (y | H, £y), then æ + y is also Gaussian distributed and given 
by 

ple +y) =N (H; + Hy, Ye + Ey). (6.78) 


Knowing that p(x + y) is Gaussian, the mean and covariance matrix can 
be determined immediately using the results from (6.46) through (6.49). 
This property will be important when we consider i.i.d. Gaussian noise 
acting on random variables, as is the case for linear regression (Chap- 
ter 9). 


Example 6.7 
Since expectations are linear operations, we can obtain the weighted sum 
of independent Gaussian random variables 


plaz + by) = N (a, + bhy, ° Ea +E). (6.79) 
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Remark. A case that will be useful in Chapter 11 is the weighted sum of 
Gaussian densities. This is different from the weighted sum of Gaussian 
random variables. © 

In Theorem 6.12, the random variable x is from a density that is a 
mixture of two densities p; (x) and p2(x), weighted by a. The theorem can 
be generalized to the multivariate random variable case, since linearity of 
expectations holds also for multivariate random variables. However, the 
idea of a squared random variable needs to be replaced by xa'. 


Theorem 6.12. Consider a mixture of two univariate Gaussian densities 
p(x) = api(x) + (1 — a)pa(z), (6.80) 


where the scalar 0 < a < 1 is the mixture weight, and p;(x) and p(x) are 
univariate Gaussian densities (Equation (6.62)) with different parameters, 
i.e., (41,07) Z (u2, 03). 

Then the mean of the mixture density p(x) is given by the weighted sum 
of the means of each random variable: 


E(x] = auı + (1 — a) u2. (6.81) 
The variance of the mixture density p(x) is given by 
Vie] = [ao?} + (1 - a)o3] + (Lan? + (1 — a) 3] — [am + (1 — a)ual’) 
(6.82) 


Proof The mean of the mixture density p(x) is given by the weighted 
sum of the means of each random variable. We apply the definition of the 
mean (Definition 6.4), and plug in our mixture (6.80), which yields 


Ejz] = a xp(x)dx (6.83a) 
= f (axp,(x) + (1 — a)axpe(x)) dx (6.83b) 
= a f xpi(x)dx + (1—a) i Lp2(x)dx (6.83c) 
= au + (1 — a)u2. (6.83d) 


To compute the variance, we can use the raw-score version of the vari- 
ance from (6.44), which requires an expression of the expectation of the 
squared random variable. Here we use the definition of an expectation of 
a function (the square) of a random variable (Definition 6.3), 


E|z?] = ic x’ p(x)dx (6.84a) 


= f (ox*ps(a) + 1 = at*pao)) ae (6.84) 
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= af” x’pı(£)dz + (1 — a) a x" po(x)dx (6.84c) 
= a(t + of) + (1—a)(u3 +03), (6.84d) 


where in the last equality, we again used the raw-score version of the 
variance (6.44) giving 0? = E[2?] — py’. This is rearranged such that the 
expectation of a squared random variable is the sum of the squared mean 
and the variance. 

Therefore, the variance is given by subtracting (6.83d) from (6.84d), 


V[2] = Elz] — (E[z])? (6.85a) 
= a(pt + 07) + (1 — a) (ug + 03) — (am + (1— a)y)” (6.85) 
= [ac] + (1— a)o3] 


+ (larf + (1 - oah] - lam + (1 - a)m]’) (6.850) 














Remark. The preceding derivation holds for any density, but since the 
Gaussian is fully determined by the mean and variance, the mixture den- 
sity can be determined in closed form. 9 


For a mixture density, the individual components can be considered 
to be conditional distributions (conditioned on the component identity). 
Equation (6.85c) is an example of the conditional variance formula, also 
known as the law of total variance, which generally states that for two ran- 
dom variables X and Y it holds that Y x|x] = Ey [V x|z|y]]+Yy [Ex [æly]], 
i.e., the (total) variance of X is the expected conditional variance plus the 
variance of a conditional mean. 

We consider in Example 6.17 a bivariate standard Gaussian random 
variable X and performed a linear transformation Aa on it. The outcome 
is a Gaussian random variable with mean zero and covariance AA". Ob- 
serve that adding a constant vector will change the mean of the distribu- 
tion, without affecting its variance, that is, the random variable a + yp is 
Gaussian with mean yp and identity covariance. Hence, any linear/affine 
transformation of a Gaussian random variable is Gaussian distributed. 

Consider a Gaussian distributed random variable X ~ N (u, X). For 
a given matrix A of appropriate shape, let Y be a random variable such 
that y = Aa is a transformed version of x. We can compute the mean of 
y by exploiting that the expectation is a linear operator (6.50) as follows: 


Ely] = E[Aa] = AE[a] = Ap. (6.86) 
Similarly the variance of y can be found by using (6.51): 
Viy] = V[Aa] = AV[z]A' = ADA’. (6.87) 
This means that the random variable y is distributed according to 


p(y) =N(y| Ap, ADA‘). (6.88) 
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Let us now consider the reverse transformation: when we know that a 
random variable has a mean that is a linear transformation of another 
random variable. For a given full rank matrix A ¢ R“*%, where M > N, 
let y € R” be a Gaussian random variable with mean Ax, i.e., 

ply) = N (y| Az, X). (6.89) 
What is the corresponding probability distribution p(æ)? If A is invert- 
ible, then we can write x = A`'y and apply the transformation in the 
previous paragraph. However, in general A is not invertible, and we use 
an approach similar to that of the pseudo-inverse (3.57). That is, we pre- 
multiply both sides with A' and then invert A' A, which is symmetric 
and positive definite, giving us the relation 


y= Az 4> (A'A)tA'y =z. (6.90) 
Hence, x is a linear transformation of y, and we obtain 
p(x) =N (z| (A'A) A'y, (A'A)TA'SA(A'A)™). (6.91) 


6.5.4 Sampling from Multivariate Gaussian Distributions 


We will not explain the subtleties of random sampling on a computer, and 
the interested reader is referred to Gentle (2004). In the case of a mul- 
tivariate Gaussian, this process consists of three stages: first, we need a 
source of pseudo-random numbers that provide a uniform sample in the 
interval [0,1]; second, we use a non-linear transformation such as the 
Box-Müller transform (Devroye, 1986) to obtain a sample from a univari- 
ate Gaussian; and third, we collate a vector of these samples to obtain a 
sample from a multivariate standard normal M (0, T). 

For a general multivariate Gaussian, that is, where the mean is non 
zero and the covariance is not the identity matrix, we use the proper- 
ties of linear transformations of a Gaussian random variable. Assume we 
are interested in generating samples x;,i = 1,...,n, from a multivariate 
Gaussian distribution with mean yw and covariance matrix X. We would 
like to construct the sample from a sampler that provides samples from 
the multivariate standard normal V (0, I ). 

To obtain samples from a multivariate normal N (p, ©), we can use 
the properties of a linear transformation of a Gaussian random variable: 
If a ~ N (0, I), then y = Ax + u, where AA' = D is Gaussian dis- 
tributed with mean p and covariance matrix X. One convenient choice of 
A is to use the Cholesky decomposition (Section 4.3) of the covariance 
matrix © = AA'. The Cholesky decomposition has the benefit that A is 
triangular, leading to efficient computation. 
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6.6 Conjugacy and the Exponential Family 


Many of the probability distributions “with names” that we find in statis- 
tics textbooks were discovered to model particular types of phenomena. 
For example, we have seen the Gaussian distribution in Section 6.5. The 
distributions are also related to each other in complex ways (Leemis and 
McQueston, 2008). For a beginner in the field, it can be overwhelming to 
figure out which distribution to use. In addition, many of these distribu- 
tions were discovered at a time that statistics and computation were done 
by pencil and paper. It is natural to ask what are meaningful concepts 
in the computing age (Efron and Hastie, 2016). In the previous section, 
we saw that many of the operations required for inference can be conve- 
niently calculated when the distribution is Gaussian. It is worth recalling 
at this point the desiderata for manipulating probability distributions in 
the machine learning context: 


1. There is some “closure property” when applying the rules of probability, 
e.g., Bayes’ theorem. By closure, we mean that applying a particular 
operation returns an object of the same type. 

2. As we collect more data, we do not need more parameters to describe 
the distribution. 

3. Since we are interested in learning from data, we want parameter es- 
timation to behave nicely. 


It turns out that the class of distributions called the exponential family 
provides the right balance of generality while retaining favorable compu- 
tation and inference properties. Before we introduce the exponential fam- 
ily, let us see three more members of “named” probability distributions, 
the Bernoulli (Example 6.8), Binomial (Example 6.9), and Beta (Exam- 
ple 6.10) distributions. 


Example 6.8 

The Bernoulli distribution is a distribution for a single binary random 
variable X with state x € {0,1}. It is governed by a single continuous pa- 
rameter ju € [0,1] that represents the probability of X = 1. The Bernoulli 
distribution Ber(j/1) is defined as 


plz|u) =u" (1-u)™, xe {0,1}, (6.92) 
Eļz] = u, (6.93) 
V[z] = (1 — n), (6.94) 


where E[r] and V[x] are the mean and variance of the binary random 
variable X. 


An example where the Bernoulli distribution can be used is when we 
are interested in modeling the probability of “heads” when flipping a coin. 
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Number m of observations x = 1 in N = 15 experiments 


Remark. The rewriting above of the Bernoulli distribution, where we use 
Boolean variables as numerical 0 or 1 and express them in the exponents, 
is a trick that is often used in machine learning textbooks. Another oc- 
curence of this is when expressing the Multinomial distribution. > 


Example 6.9 (Binomial Distribution) 

The Binomial distribution is a generalization of the Bernoulli distribution 
to a distribution over integers (illustrated in Figure 6.10). In particular, 
the Binomial can be used to describe the probability of observing m oc- 
currences of X = 1 in a set of N samples from a Bernoulli distribution 
where p(X = 1) = w € [0,1]. The Binomial distribution Bin(N, p) is 
defined as 


N 
pm|N, y) = | eaa, (6.95) 
Efm] = Nu, (6.96) 
Yim] = Na(l — u), (6.97) 


where E[m]| and V[m] are the mean and variance of m, respectively. 


An example where the Binomial could be used is if we want to describe 
the probability of observing m “heads” in N coin-flip experiments if the 
probability for observing head in a single experiment is ju. 


Example 6.10 (Beta Distribution) 

We may wish to model a continuous random variable on a finite interval. 
The Beta distribution is a distribution over a continuous random variable 
u € [0,1], which is often used to represent the probability for some binary 
event (e.g., the parameter governing the Bernoulli distribution). The Beta 


Draft (2021-01-14) of “Mathematics for Machine Learning”. Feedback: https: //mml-book.com. 


6.6 Conjugacy and the Exponential Family 207 


distribution Beta(a, 3) (illustrated in Figure 6.11) itself is governed by 
two parameters a > 0, @ > 0 and is defined as 


ORRORE 


__4 = ae 
where T (-) is the Gamma function defined as 
BOES [ x’ * exp(—a)dz, iS: (6.100) 
0 
T(t +1) =¢(t). (6.101) 


Note that the fraction of Gamma functions in (6.98) normalizes the Beta 
distribution. 


10 





— a=05=86 
81 —— a= 1=8 
— a=2,8=03 
& 6| — a=4,8=10 
aS — a=5,f=1 
3 
Z4 
2 





Intuitively, a moves probability mass toward 1, whereas 8 moves prob- 
ability mass toward 0. There are some special cases (Murphy, 2012): 


« For a = 1 = £, we obtain the uniform distribution U [0, 1]. 
= For a, 3 < 1, we get a bimodal distribution with spikes at 0 and 1. 
= For a, 8 > 1, the distribution is unimodal. 


= For a, ß > 1 and a = p, the distribution is unimodal, symmetric, and 
centered in the interval [0, 1], i.e., the mode/mean is at ż. 


Remark. There is a whole zoo of distributions with names, and they are 
related in different ways to each other (Leemis and McQueston, 2008). 
It is worth keeping in mind that each named distribution is created for a 
particular reason, but may have other applications. Knowing the reason 
behind the creation of a particular distribution often allows insight into 
how to best use it. We introduced preceding three distributions to be able 
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to illustrate the concepts of conjugacy (Section 6.6.1) and exponential 
families (Section 6.6.3). © 


6.6.1 Conjugacy 


According to Bayes’ theorem (6.23), the posterior is proportional to the 
product of the prior and the likelihood. The specification of the prior can 
be tricky for two reasons: First, the prior should encapsulate our knowl- 
edge about the problem before we see any data. This is often difficult to 
describe. Second, it is often not possible to compute the posterior distribu- 
tion analytically. However, there are some priors that are computationally 
convenient: conjugate priors. 


Definition 6.13 (Conjugate Prior). A prior is conjugate for the likelihood 
function if the posterior is of the same form/type as the prior. 


Conjugacy is particularly convenient because we can algebraically cal- 
culate our posterior distribution by updating the parameters of the prior 
distribution. 


Remark. When considering the geometry of probability distributions, con- 
jugate priors retain the same distance structure as the likelihood (Agarwal 
and Daumé III, 2010). ©% 


To introduce a concrete example of conjugate priors, we describe in Ex- 
ample 6.11 the Binomial distribution (defined on discrete random vari- 
ables) and the Beta distribution (defined on continuous random vari- 
ables). 


Example 6.11 (Beta-Binomial Conjugacy) 
Consider a Binomial random variable x ~ Bin(N, u) where 


N 
elm = (Zea O e o 


is the probability of finding x times the outcome “heads” in N coin flips, 
where pu is the probability of a “head”. We place a Beta prior on the pa- 
rameter u, that is, y ~ Beta(a, 3), where 


plu|a, 8) = Pague cm. (6.103) 


If we now observe some outcome x = h, that is, we see h heads in N coin 
flips, we compute the posterior distribution on u as 


plu |z = h, N,a, b) x p(z | N, u)p(u | a, £) (6.104a) 
E T E (6.104b) 
= i all S) DOT (6.104c) 
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Likelihood Conjugate prior Posterior 
Bernoulli Beta Beta 
Binomial Beta Beta 
Gaussian Gaussian/inverse Gamma Gaussian/inverse Gamma 
Gaussian Gaussian/inverse Wishart Gaussian/inverse Wishart 
Multinomial Dirichlet Dirichlet 

« Beta(h +a,N —h+ 8), (6.104d) 


i.e., the posterior distribution is a Beta distribution as the prior, i.e., the 
Beta prior is conjugate for the parameter u in the Binomial likelihood 
function. 


In the following example, we will derive a result that is similar to the 
Beta-Binomial conjugacy result. Here we will show that the Beta distribu- 
tion is a conjugate prior for the Bernoulli distribution. 


Example 6.12 (Beta-Bernoulli Conjugacy) 

Let x € {0,1} be distributed according to the Bernoulli distribution with 

parameter 0 € [0,1], that is, p(x = 1|0) = 6. This can also be expressed 

as p(x |) = 0*(1 — 6)'~*. Let @ be distributed according to a Beta distri- 

bution with parameters a, (3, that is, p(@| a, 8) « 0°-1(1 — @)8-!. 
Multiplying the Beta and the Bernoulli distributions, we get 


p(O|x,a, B) = p(x| @)p(| a, b) (6.105a) 
Cop let eee ete | ae (6.105b) 
See (le hee at (6.105c) 
x p(@lata2z,6+(1-2)). (6.105d) 


The last line is the Beta distribution with parameters (a+ 2,6+(1—2)). 


Table 6.2 lists examples for conjugate priors for the parameters of some 
standard likelihoods used in probabilistic modeling. Distributions such as 
Multinomial, inverse Gamma, inverse Wishart, and Dirichlet can be found 
in any statistical text, and are described in Bishop (2006), for example. 

The Beta distribution is the conjugate prior for the parameter jz in both 
the Binomial and the Bernoulli likelihood. For a Gaussian likelihood func- 
tion, we can place a conjugate Gaussian prior on the mean. The reason 
why the Gaussian likelihood appears twice in the table is that we need 
to distinguish the univariate from the multivariate case. In the univariate 
(scalar) case, the inverse Gamma is the conjugate prior for the variance. 
In the multivariate case, we use a conjugate inverse Wishart distribution 
as a prior on the covariance matrix. The Dirichlet distribution is the conju- 
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gate prior for the multinomial likelihood function. For further details, we 
refer to Bishop (2006). 


6.6.2 Sufficient Statistics 


Recall that a statistic of a random variable is a deterministic function of 
that random variable. For example, if x = [1,,...,2n]' is a vector of 
univariate Gaussian random variables, that is, zn ~ M (u, o°), then the 
sample mean fi = +(x, +--+ xy) is a statistic. Sir Ronald Fisher dis- 
covered the notion of sufficient statistics: the idea that there are statistics 
that will contain all available information that can be inferred from data 
corresponding to the distribution under consideration. In other words, suf- 
ficient statistics carry all the information needed to make inference about 
the population, that is, they are the statistics that are sufficient to repre- 
sent the distribution. 

For a set of distributions parametrized by 0, let X be a random variable 
with distribution p(x | 0o) given an unknown ĝo. A vector (x) of statistics 
is called sufficient statistics for 0) if they contain all possible informa- 
tion about 0). To be more formal about “contain all possible information”, 
this means that the probability of x given @ can be factored into a part 
that does not depend on 8, and a part that depends on @ only via ¢(z). 
The Fisher-Neyman factorization theorem formalizes this notion, which 
we state in Theorem 6.14 without proof. 


Theorem 6.14 (Fisher-Neyman). [Theorem 6.5 in Lehmann and Casella 
(1998)] Let X have probability density function p(x | 0). Then the statistics 
o(x) are sufficient for 0 if and only if p(x | @) can be written in the form 


P(x] 0) = h(x) go(P(@)) , (6.106) 


where h(x) is a distribution independent of 6 and go captures all the depen- 
dence on 0 via sufficient statistics $(x). 


If p(x | @) does not depend on 9, then ¢(z) is trivially a sufficient statistic 
for any function ¢. The more interesting case is that p(a | 0) is dependent 
only on ¢(x) and not z itself. In this case, $(x) is a sufficient statistic for 
0. 

In machine learning, we consider a finite number of samples from a 
distribution. One could imagine that for simple distributions (such as the 
Bernoulli in Example 6.8) we only need a small number of samples to 
estimate the parameters of the distributions. We could also consider the 
opposite problem: If we have a set of data (a sample from an unknown 
distribution), which distribution gives the best fit? A natural question to 
ask is, as we observe more data, do we need more parameters 0 to de- 
scribe the distribution? It turns out that the answer is yes in general, and 
this is studied in non-parametric statistics (Wasserman, 2007). A converse 
question is to consider which class of distributions have finite-dimensional 
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sufficient statistics, that is the number of parameters needed to describe 
them does not increase arbitrarily. The answer is exponential family dis- 
tributions, described in the following section. 


6.6.3 Exponential Family 


There are three possible levels of abstraction we can have when con- 
sidering distributions (of discrete or continuous random variables). At 
level one (the most concrete end of the spectrum), we have a particu- 
lar named distribution with fixed parameters, for example a univariate 
Gaussian V (0, 1) with zero mean and unit variance. In machine learning, 
we often use the second level of abstraction, that is, we fix the paramet- 
ric form (the univariate Gaussian) and infer the parameters from data. For 
example, we assume a univariate Gaussian N (u, o?) with unknown mean 
u and unknown variance o°, and use a maximum likelihood fit to deter- 
mine the best parameters (ju, 0”). We will see an example of this when 
considering linear regression in Chapter 9. A third level of abstraction is 
to consider families of distributions, and in this book, we consider the ex- 
ponential family. The univariate Gaussian is an example of a member of 
the exponential family. Many of the widely used statistical models, includ- 
ing all the “named” models in Table 6.2, are members of the exponential 
family. They can all be unified into one concept (Brown, 1986). 


Remark. A brief historical anecdote: Like many concepts in mathemat- 
ics and science, exponential families were independently discovered at 
the same time by different researchers. In the years 1935-1936, Edwin 
Pitman in Tasmania, Georges Darmois in Paris, and Bernard Koopman in 
New York independently showed that the exponential families are the only 
families that enjoy finite-dimensional sufficient statistics under repeated 
independent sampling (Lehmann and Casella, 1998). ro 


An exponential family is a family of probability distributions, parame- 
terized by 60 € R”, of the form 


p(x |0) = h(a) exp ((0, b(a)) — A(A)) , (6.107) 


where @(x) is the vector of sufficient statistics. In general, any inner prod- 
uct (Section 3.2) can be used in (6.107), and for concreteness we will use 
the standard dot product here ((0, @(x)) = @' (a)). Note that the form 
of the exponential family is essentially a particular expression of gg (#(x)) 
in the Fisher-Neyman theorem (Theorem 6.14). 

The factor h(a) can be absorbed into the dot product term by adding 
another entry (log h(a)) to the vector of sufficient statistics @(a), and 
constraining the corresponding parameter 0) = 1. The term A(@) is the 
normalization constant that ensures that the distribution sums up or inte- 
grates to one and is called the log-partition function. A good intuitive no- 
tion of exponential families can be obtained by ignoring these two terms 
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and considering exponential families as distributions of the form 
p(a| 0) x exp (0' b(a)) . (6.108) 


For this form of parametrization, the parameters 6 are called the natural 
parameters. At first glance, it seems that exponential families are a mun- 
dane transformation by adding the exponential function to the result of a 
dot product. However, there are many implications that allow for conve- 
nient modeling and efficient computation based on the fact that we can 
capture information about data in f(a). 


Example 6.13 (Gaussian as Exponential Family) 


Consider the univariate Gaussian distribution N (u, o°). Let (x) = 2 : 


Then by using the definition of the exponential family, 


p(z |0) x exp(0ix + 0227) . (6.109) 
Setting 
o= E zh (6.110) 
o?’ 20? 
and substituting into (6.109), we obtain 
pls |0) x exp (4 -=) x exp (~ 532 - n)”) ; (6.111) 


Therefore, the univariate Gaussian distribution is a member of the expo- 
nential family with sufficient statistic d(x) = 2 , and natural parame- 


ters given by @ in (6.110). 


Example 6.14 (Bernoulli as Exponential Family) 
Recall the Bernoulli distribution from Example 6.8 


ple|p)=prd—p)*, xe {0,1}. (6.112) 
This can be written in exponential family form 
p(z | u) = exp [log (u?(1 — »)**)] (6.113a) 
= exp [z log u + (1 — x) log(1 — p)] (6.113b) 
= exp [x log wu — xlog(1 — w) + log(1 — p)] (6.113¢) 
= exp E log p44 + log(1 — 1)| , (6.113d) 


The last line (6.113d) can be identified as being in exponential family 
form (6.107) by observing that 


h(x) =1 (6.114) 
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ee ee (6.115) 
is (6.116) 
A(@) = — log(1 — uw) = log(1 + exp(6)). (6.117) 
The relationship between 6 and yp is invertible so that 
1 
Z 6.118 
a exp(—0) í l 


The relation (6.118) is used to obtain the right equality of (6.117). 


Remark. The relationship between the original Bernoulli parameter u and 
the natural parameter 0 is known as the sigmoid or logistic function. Ob- 
serve that 4. € (0,1) but 0 € R, and therefore the sigmoid function 
squeezes a real value into the range (0, 1). This property is useful in ma- 
chine learning, for example it is used in logistic regression (Bishop, 2006, 
section 4.3.2), as well as as a nonlinear activation functions in neural net- 
works (Goodfellow et al., 2016, chapter 6). & 


It is often not obvious how to find the parametric form of the conjugate 
distribution of a particular distribution (for example, those in Table 6.2). 
Exponential families provide a convenient way to find conjugate pairs of 
distributions. Consider the random variable X is a member of the expo- 
nential family (6.107): 


p(x |0) = h(x) exp ((0, p(x)) — A(8)) . (6.119) 


Every member of the exponential family has a conjugate prior (Brown, 
1986) 


DOIN =r (EL olm) 61209 


where y = z has dimension dim(@) + 1. The sufficient statistics of 
2 


0 
form of conjugate priors for exponential families, we can derive functional 
forms of conjugate priors corresponding to particular distributions. 


the conjugate prior are B j| . By using the knowledge of the general 





Example 6.15 
Recall the exponential family form of the Bernoulli distribution (6.113d) 
p(z |u) = exp |x log i = 7 +log(1— p)|. (6.121) 
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The canonical conjugate prior has the form 


P S (8 +.) log(1 ~ n) ~ Aly) | 


u 
E EE, 
(6.122) 
where we defined 7 := [a,3 + a]' and h.(u) := u/(1 — u). Equa- 
tion (6.122) then simplifies to 


p(t | a, 8) = exp [(a — 1) log u + (8 — 1) log(1 — 4) — A-(a, 8)] - 
(6.123) 





plu |a, b) = 


Putting this in non-exponential family form yields 


P a E y (6.124) 


which we identify as the Beta distribution (6.98). In example 6.12, we 
assumed that the Beta distribution is the conjugate prior of the Bernoulli 
distribution and showed that it was indeed the conjugate prior. In this 
example, we derived the form of the Beta distribution by looking at the 
canonical conjugate prior of the Bernoulli distribution in exponential fam- 
ily form. 


As mentioned in the previous section, the main motivation for expo- 
nential families is that they have finite-dimensional sufficient statistics. 
Additionally, conjugate distributions are easy to write down, and the con- 
jugate distributions also come from an exponential family. From an infer- 
ence perspective, maximum likelihood estimation behaves nicely because 
empirical estimates of sufficient statistics are optimal estimates of the pop- 
ulation values of sufficient statistics (recall the mean and covariance of a 
Gaussian). From an optimization perspective, the log-likelihood function 
is concave, allowing for efficient optimization approaches to be applied 
(Chapter 7). 


6.7 Change of Variables/Inverse Transform 


It may seem that there are very many known distributions, but in reality 
the set of distributions for which we have names is quite limited. There- 
fore, it is often useful to understand how transformed random variables 
are distributed. For example, assuming that X is a random variable dis- 
tributed according to the univariate normal distribution M (0, 1), what is 
the distribution of X?? Another example, which is quite common in ma- 
chine learning, is, given that X, and X, are univariate standard normal, 
what is the distribution of $(X, + X2)? 

One option to work out the distribution of $(X, + X2) is to calculate 
the mean and variance of X and X> and then combine them. As we saw 
in Section 6.4.4, we can calculate the mean and variance of resulting ran- 
dom variables when we consider affine transformations of random vari- 
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ables. However, we may not be able to obtain the functional form of the 
distribution under transformations. Furthermore, we may be interested 
in nonlinear transformations of random variables for which closed-form 
expressions are not readily available. 


Remark (Notation). In this section, we will be explicit about random vari- 
ables and the values they take. Hence, recall that we use capital letters 
X,Y to denote random variables and small letters x, y to denote the val- 
ues in the target space 7 that the random variables take. We will explicitly 
write pmfs of discrete random variables X as P(X = x). For continuous 
random variables X (Section 6.2.2), the pdf is written as f(x) and the cdf 
is written as F’y (x). 


We will look at two approaches for obtaining distributions of transfor- 
mations of random variables: a direct approach using the definition of a 
cumulative distribution function and a change-of-variable approach that 
uses the chain rule of calculus (Section 5.2.2). The change-of-variable ap- 
proach is widely used because it provides a “recipe” for attempting to 
compute the resulting distribution due to a transformation. We will ex- 
plain the techniques for univariate random variables, and will only briefly 
provide the results for the general case of multivariate random variables. 

Transformations of discrete random variables can be understood di- 
rectly. Suppose that there is a discrete random variable X with pmf P(X = 
x) (Section 6.2.1), and an invertible function U(x). Consider the trans- 
formed random variable Y := U(X), with pmf P(Y = y). Then 


P(Y = y) = P(U(X) = y) transformation of interest (6.125a) 
= P(X = U™!(y)) inverse (6.125b) 
where we can observe that x = U~'(y). Therefore, for discrete random 


variables, transformations directly change the individual events (with the 
probabilities appropriately transformed). 


6.7.1 Distribution Function Technique 


The distribution function technique goes back to first principles, and uses 
the definition of a cdf Fx (x) = P(X < x) and the fact that its differential 
is the pdf f(x) (Wasserman, 2004, chapter 2). For a random variable X 
and a function U, we find the pdf of the random variable Y := U(X) by 
1. Finding the cdf: 


Fy(y) = P(Y < y) (6.126) 


2. Differentiating the cdf Fy (y) to get the pdf f(y). 


d 


fu) = gyo W: (6.127) 
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We also need to keep in mind that the domain of the random variable may 
have changed due to the transformation by U. 


Example 6.16 
Let X be a continuous random variable with probability density function 
on0O<a2<l 


sare (6.128) 
We are interested in finding the pdf of Y = X?. 


The function f is an increasing function of x, and therefore the resulting 
value of y lies in the interval [0, 1]. We obtain 


Fy(y)= P(Y < y) definition of cdf (6.129a) 
= P(X? < y) transformation of interest (6.129b) 
ROES y?) inverse (6.129c) 
= Fy(y?) definition of cdf (6.129d) 
y2 
= i 3t? dt cdf as a definite integral (6.129e) 
0 
= a result of integration (6.129f) 
=n are (6.1298) 
Therefore, the cdf of Y is 

Fy(y) =y? (6.130) 

for 0 < y < 1. To obtain the pdf, we differentiate the cdf 

d Sa 

= — Ff = -y? 6.131 
f(y) af PEY ( ) 


forO<y<l. 


In Example 6.16, we considered a strictly monotonically increasing func- 
tion f(x) = 3x7. This means that we could compute an inverse function. 
In general, we require that the function of interest y = U(x) has an in- 
verse x = U`! (y). A useful result can be obtained by considering the cu- 
mulative distribution function F(x) of a random variable X, and using 
it as the transformation U(x). This leads to the following theorem. 


Theorem 6.15. [Theorem 2.1.10 in Casella and Berger (2002) ] Let X bea 
continuous random variable with a strictly monotonic cumulative distribu- 
tion function Fy (x). Then the random variable Y defined as 


Y := Fx(X) (6.132) 


has a uniform distribution. 
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Theorem 6.15 is known as the probability integral transform, and it is 
used to derive algorithms for sampling from distributions by transforming 
the result of sampling from a uniform random variable (Bishop, 2006). 
The algorithm works by first generating a sample from a uniform distribu- 
tion, then transforming it by the inverse cdf (assuming this is available) 
to obtain a sample from the desired distribution. The probability integral 
transform is also used for hypothesis testing whether a sample comes from 
a particular distribution (Lehmann and Romano, 2005). The idea that the 
output of a cdf gives a uniform distribution also forms the basis of copu- 
las (Nelsen, 2006). 


6.7.2 Change of Variables 


The distribution function technique in Section 6.7.1 is derived from first 
principles, based on the definitions of cdfs and using properties of in- 
verses, differentiation, and integration. This argument from first principles 
relies on two facts: 


1. We can transform the cdf of Y into an expression that is a cdf of X. 
2. We can differentiate the cdf to obtain the pdf. 


Let us break down the reasoning step by step, with the goal of understand- 
ing the more general change-of-variables approach in Theorem 6.16. 


Remark. The name “change of variables” comes from the idea of chang- 
ing the variable of integration when faced with a difficult integral. For 
univariate functions, we use the substitution rule of integration, 


| fo@)9@ax = | fan, where u= g(x). (6.133) 


The derivation of this rule is based on the chain rule of calculus (5.32) and 
by applying twice the fundamental theorem of calculus. The fundamental 
theorem of calculus formalizes the fact that integration and differentiation 
are somehow “inverses” of each other. An intuitive understanding of the 
rule can be obtained by thinking (loosely) about small changes (differen- 
tials) to the equation u = g(x), that is by considering Au = g'(x)Ax asa 
differential of u = g(x). By subsituting u = g(x), the argument inside the 
integral on the right-hand side of (6.133) becomes f(g(x)). By pretending 
that the term du can be approximated by du = Au = g'(x)Azx, and that 
dz ~ Ax, we obtain (6.133). 


Consider a univariate random variable X, and an invertible function 
U, which gives us another random variable Y = U(X). We assume that 
random variable X has states x € [a,b]. By the definition of the cdf, we 
have 


Fy (y) = P(Y < y). (6.134) 
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We are interested in a function U of the random variable 
P(Y < y)= P(U(X) <y), (6.135) 


where we assume that the function U is invertible. An invertible function 
on an interval is either strictly increasing or strictly decreasing. In the case 
that U is strictly increasing, then its inverse U~’ is also strictly increasing. 
By applying the inverse U—' to the arguments of P(U(X) < y), we obtain 
P(U(X) < y) = P(UU"(U(X)) < U™(y)) = P(X < Uy). 

(6.136) 
The right-most term in (6.136) is an expression of the cdf of X. Recall the 
definition of the cdf in terms of the pdf 


U`! (y) 
PAST G= f Pan: (6.137) 
Now we have an expression of the cdf of Y in terms of x: 
U~*(y) 

Ray I Tolir: (6.138) 

To obtain the pdf, we differentiate (6.138) with respect to y: 

d d (VW) 3 
Sap = ‘ 6.13 

f= FF) =— [fede (6.139) 


Note that the integral on the right-hand side is with respect to x, but we 
need an integral with respect to y because we are differentiating with 
respect to y. In particular, we use (6.133) to get the substitution 


J Aua @ay= T Fade nar tet a 16.140) 
Using (6.140) on the right-hand side of (6.139) gives us 


r= 


We then recall that differentiation is a linear operator and we use the 
subscript x to remind ourselves that f,(U~'(y)) is a function of x and not 
y. Invoking the fundamental theorem of calculus again gives us 


= d 
ORUA EE (6.142) 
Recall that we assumed that U is a strictly increasing function. For decreas- 
ing functions, it turns out that we have a negative sign when we follow 
the same derivation. We introduce the absolute value of the differential to 
have the same expression for both increasing and decreasing U: 


I= AUTO E 


U`! (y) 
/ Ea uu Gide (6.141) 


U-*(y)| . (6.143) 
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This is called the change-of-variable technique. The term jeu “(y) in 


(6.143) measures how much a unit volume changes when applying U 
(see also the definition of the Jacobian in Section 5.3). 


Remark. In comparison to the discrete case in (6.125b), we have an addi- 
tional factor jeu a), The continuous case requires more care because 


P(Y = y) = 0 for all y. The probability density function f(y) does not 
have a description as a probability of an event involving y. 


So far in this section, we have been studying univariate change of vari- 
ables. The case for multivariate random variables is analogous, but com- 
plicated by fact that the absolute value cannot be used for multivariate 
functions. Instead, we use the determinant of the Jacobian matrix. Recall 
from (5.58) that the Jacobian is a matrix of partial derivatives, and that 
the existence of a nonzero determinant shows that we can invert the Ja- 
cobian. Recall the discussion in Section 4.1 that the determinant arises 
because our differentials (cubes of volume) are transformed into paral- 
lelepipeds by the Jacobian. Let us summarize preceding the discussion in 
the following theorem, which gives us a recipe for multivariate change of 
variables. 


Theorem 6.16. [Theorem 17.2 in Billingsley (1995)] Let f (x) be the value 
of the probability density of the multivariate continuous random variable X. 
If the vector-valued function y = U(x) is differentiable and invertible for 
all values within the domain of æ, then for corresponding values of y, the 
probability density of Y = U(X) is given by 


f(y) = fe(U*(y)) - 





Ə 


The theorem looks intimidating at first glance, but the key point is that 
a change of variable of a multivariate random variable follows the pro- 
cedure of the univariate change of variable. First we need to work out 
the inverse transform, and substitute that into the density of x. Then we 
calculate the determinant of the Jacobian and multiply the result. The 
following example illustrates the case of a bivariate random variable. 


Example 6.17 
$ 


Consider a bivariate random variable X with states x = |: | and proba- 


2 
bility density function 


(E)-ž+E E) e% 


We use the change-of-variable technique from Theorem 6.16 to derive the 
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effect of a linear transformation (Section 2.7) of the random variable. 
Consider a matrix A € R?*? defined as 


a b 
A= é J : (6.146) 
We are interested in finding the probability density function of the trans- 
formed bivariate random variable Y with states y = Az. 

Recall that for change of variables we require the inverse transformation 
of x as a function of y. Since we consider linear transformations, the 
inverse transformation is given by the matrix inverse (see Section 2.2.2). 
For 2 x 2 matrices, we can explicitly write out the formula, given by 


Tı —1 | Y1 1 Ceol) kai 
—A = ý 6.1 
is a ad — bc e a | A one 
Observe that ad — bc is the determinant (Section 4.1) of A. The corre- 
sponding probability density function is given by 





1 
f(æ) = f(A'y) = s-exp(-yyATATy). (6.148) 


The partial derivative of a matrix times a vector with respect to the vector 
is the matrix itself (Section 5.5), and therefore 


o 

SAY A (6.149) 
Oy 

Recall from Section 4.1 that the determinant of the inverse is the inverse 

of the determinant so that the determinant of the Jacobian matrix is 





Chee nea! 
det (SA y= (6.150) 


We are now able to apply the change-of-variable formula from Theo- 
rem 6.16 by multiplying (6.148) with (6.150), which yields 





a (24w) oe 
= =- exp (-iyT ATT Ay) lad — be~". (6.151b) 


While Example 6.17 is based on a bivariate random variable, which al- 
lows us to easily compute the matrix inverse, the preceding relation holds 
for higher dimensions. 


Remark. We saw in Section 6.5 that the density f(a) in (6.148) is actually 
the standard Gaussian distribution, and the transformed density f(y) is a 
bivariate Gaussian with covariance © = AA’. 


We will use the ideas in this chapter to describe probabilistic modeling 


Draft (2021-01-14) of “Mathematics for Machine Learning”. Feedback: https: //mml-book.com. 


6.8 Further Reading 221 


in Section 8.4, as well as introduce a graphical language in Section 8.5. We 
will see direct machine learning applications of these ideas in Chapters 9 
and 11. 


6.8 Further Reading 


This chapter is rather terse at times. Grinstead and Snell (1997) and 
Walpole et al. (2011) provide more relaxed presentations that are suit- 
able for self-study. Readers interested in more philosophical aspects of 
probability should consider Hacking (2001), whereas an approach that 
is more related to software engineering is presented by Downey (2014). 
An overview of exponential families can be found in Barndorff-Nielsen 
(2014). We will see more about how to use probability distributions to 
model machine learning tasks in Chapter 8. Ironically, the recent surge 
in interest in neural networks has resulted in a broader appreciation of 
probabilistic models. For example, the idea of normalizing flows (Jimenez 
Rezende and Mohamed, 2015) relies on change of variables for transform- 
ing random variables. An overview of methods for variational inference as 
applied to neural networks is described in chapters 16 to 20 of the book 
by Goodfellow et al. (2016). 

We side stepped a large part of the difficulty in continuous random vari- 
ables by avoiding measure theoretic questions (Billingsley, 1995; Pollard, 
2002), and by assuming without construction that we have real numbers, 
and ways of defining sets on real numbers as well as their appropriate fre- 
quency of occurrence. These details do matter, for example, in the specifi- 
cation of conditional probability p(y | x) for continuous random variables 
x,y (Proschan and Presnell, 1998). The lazy notation hides the fact that 
we want to specify that X = x (which is a set of measure zero). Fur- 
thermore, we are interested in the probability density function of y. A 
more precise notation would have to say E,|f(y) | o(x)], where we take 
the expectation over y of a test function f conditioned on the o-algebra of 
x. A more technical audience interested in the details of probability the- 
ory have many options (Jaynes, 2003; MacKay, 2003; Jacod and Protter, 
2004; Grimmett and Welsh, 2014), including some very technical discus- 
sions (Shiryayev, 1984; Lehmann and Casella, 1998; Dudley, 2002; Bickel 
and Doksum, 2006; Cinlar, 2011). An alternative way to approach proba- 
bility is to start with the concept of expectation, and “work backward” to 
derive the necessary properties of a probability space (Whittle, 2000). As 
machine learning allows us to model more intricate distributions on ever 
more complex types of data, a developer of probabilistic machine learn- 
ing models would have to understand these more technical aspects. Ma- 
chine learning texts with a probabilistic modeling focus include the books 
by MacKay (2003); Bishop (2006); Rasmussen and Williams (2006); Bar- 
ber (2012); Murphy (2012). 
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Exercises 


6.1 Consider the following bivariate distribution p(x, y) of two discrete random 


variables X and Y. 
» fom Jo 
e apala sso o> 
[esos 
Tı T2 T3 T4 T5 


X 





Y3 


Compute: 


a. The marginal distributions p(x) and p(y). 
b. The conditional distributions p(2|Y = y1) and p(y|X = z3). 


6.2 Consider a mixture of two Gaussian distributions (illustrated in Figure 6.4), 


oan (Fa) fo a) terah Bo ta): 


a. Compute the marginal distributions for each dimension. 
b. Compute the mean, mode and median for each marginal distribution. 
c. Compute the mean and mode for the two-dimensional distribution. 


6.3 You have written a computer program that sometimes compiles and some- 
times not (code does not change). You decide to model the apparent stochas- 
ticity (success vs. no success) x of the compiler using a Bernoulli distribution 
with parameter p: 


plz|u)=u"(1- u), we {0,1}. 


Choose a conjugate prior for the Bernoulli likelihood and compute the pos- 
terior distribution p(w|x1,...,2y). 

6.4 There are two bags. The first bag contains four mangos and two apples; the 
second bag contains four mangos and four apples. 
We also have a biased coin, which shows “heads” with probability 0.6 and 
“tails” with probability 0.4. If the coin shows “heads”. we pick a fruit at 
random from bag 1; otherwise we pick a fruit at random from bag 2. 
Your friend flips the coin (you cannot see the result), picks a fruit at random 
from the corresponding bag, and presents you a mango. 
What is the probability that the mango was picked from bag 2? 
Hint: Use Bayes’ theorem. 

6.5 Consider the time-series model 


£t+1 = Axı +w, w~N(0, Q) 
yY, = Cti +v, v ~ N (0, R), 


where w, v are i.i.d. Gaussian noise variables. Further, assume that p(xo) = 


N (Ho, £o). 
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6.6 


6.7 


6.8 


6.9 


6.10 


6.11 


6.12 


a. What is the form of p(xzo, Œ1,..., ær)? Justify your answer (you do not 
have to explicitly compute the joint distribution). 
b. Assume that p(x: | y1,- Y+) = N (ui, Xe). 


1. Compute p(£t+1 | Y1,- --, Y4). 

2. Compute p(£t+1, Yt+1| Y1- -- Yt) 

3. At time t+1, we observe the value y,,, = y. Compute the conditional 
distribution p(æt+1 | Y1,- --, Yt+1)- 





Prove the relationship in (6.44), which relates the standard definition of the 
variance to the raw-score expression for the variance. 

Prove the relationship in (6.45), which relates the pairwise difference be- 
tween examples in a dataset with the raw-score expression for the variance. 
Express the Bernoulli distribution in the natural parameter form of the ex- 
ponential family, see (6.107). 

Express the Binomial distribution as an exponential family distribution. Also 
express the Beta distribution is an exponential family distribution. Show that 
the product of the Beta and the Binomial distribution is also a member of 
the exponential family. 

Derive the relationship in Section 6.5.2 in two ways: 


a. By completing the square 
b. By expressing the Gaussian in its exponential family form 


The product of two Gaussians N (æ |a, A)N (x |b, B) is an unnormalized 
Gaussian distribution cN (æ |c, C) with 


C=(A7t4 B) 
c=C(A`'a + B7!b) 


c= Ga T |A+B|72 exp ( — }(a — b)! (A + B)™+ (a — b)) . 


Note that the normalizing constant c itself can be considered a (normalized) 
Gaussian distribution either in a or in b with an “inflated” covariance matrix 
A+B, ie. c=N(a|b, A+B) =N (bla, A+B). 

Iterated Expectations. 

Consider two random variables x, y with joint distribution p(x, y). Show that 


Ex[z] = Ey [Ex [z | y]] - 


Here, E x [x |y] denotes the expected value of x under the conditional distri- 
bution p(z | y). 

Manipulation of Gaussian Random Variables. 

Consider a Gaussian random variable æ ~ N (æ | pz, Sx), where æ € RP. 
Furthermore, we have 


y=Axrt+b+uw, 


where y € RË, A € RË*?, b € RË, and w ~ N (w|0, Q) is indepen- 

dent Gaussian noise. “Independent” implies that x and w are independent 

random variables and that Q is diagonal. 

a. Write down the likelihood p(y | æ). 

b. The distribution p(y) = [ p(y | x)p(a)da is Gaussian. Compute the mean 
H and the covariance Sy. Derive your result in detail. 
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c. The random variable y is being transformed according to the measure- 
ment mapping 


z=Cy+v, 


where z € R”, C € R?**, and v ~ N(v|0, R) is independent Gaus- 
sian (measurement) noise. 


= Write down p(z |y). 
= Compute p(z), i.e., the mean p, and the covariance Xz. Derive your 
result in detail. 


d. Now, a value ĝ is measured. Compute the posterior distribution p(æ | ĝ). 
Hint for solution: This posterior is also Gaussian, i.e., we need to de- 
termine only its mean and covariance matrix. Start by explicitly com- 
puting the joint Gaussian p(x, y). This also requires us to compute the 
cross-covariances Cove,y[x, y] and Covy,«/y,2]. Then apply the rules 
for Gaussian conditioning. 


6.13 Probability Integral Transformation 
Given a continuous random variable X, with cdf Fx (x), show that the ran- 
dom variable Y := Fy (X) is uniformly distributed (Theorem 6.15). 
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Continuous Optimization 


Since machine learning algorithms are implemented on a computer, the 
mathematical formulations are expressed as numerical optimization meth- 
ods. This chapter describes the basic numerical methods for training ma- 
chine learning models. Training a machine learning model often boils 
down to finding a good set of parameters. The notion of “good” is de- 
termined by the objective function or the probabilistic model, which we 
will see examples of in the second part of this book. Given an objective 
function, finding the best value is done using optimization algorithms. 

This chapter covers two main branches of continuous optimization (Fig- 
ure 7.1): unconstrained and constrained optimization. We will assume in 
this chapter that our objective function is differentiable (see Chapter 5), 
hence we have access to a gradient at each location in the space to help us 
find the optimum value. By convention, most objective functions in ma- 
chine learning are intended to be minimized, that is, the best value is the 
minimum value. Intuitively finding the best value is like finding the val- 
leys of the objective function, and the gradients point us uphill. The idea is 
to move downhill (opposite to the gradient) and hope to find the deepest 
point. For unconstrained optimization, this is the only concept we need, 
but there are several design choices, which we discuss in Section 7.1. For 
constrained optimization, we need to introduce other concepts to man- 
age the constraints (Section 7.2). We will also introduce a special class 
of problems (convex optimization problems in Section 7.3) where we can 
make statements about reaching the global optimum. 

Consider the function in Figure 7.2. The function has a global minimum 
around « = —4.5, with a function value of approximately —47. Since 
the function is “smooth,” the gradients can be used to help find the min- 
imum by indicating whether we should take a step to the right or left. 
This assumes that we are in the correct bowl, as there exists another local 
minimum around x = 0.7. Recall that we can solve for all the stationary 
points of a function by calculating its derivative and setting it to zero. For 





L(x) = zt + 7z? + 52° — 177 +3, (7.1) 
we obtain the corresponding gradient as 
dé 
(E) 2408 4.5i0? 4 ioe 17. (7.2) 
daz 
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Since we consider 
data and models in 
R?, the 
optimization 
problems we face 
are continuous 
optimization 
problems, as 
opposed to 
combinatorial 
optimization 
problems for 
discrete variables. 


global minimum 


local minimum 


Stationary points 
are the real roots of 
the derivative, that 
is, points that have 
zero gradient. 


Figure 7.1 A mind 
map of the concepts 
related to 
optimization, as 
presented in this 
chapter. There are 
two main ideas: 
gradient descent 
and convex 
optimization. 
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Since this is a cubic equation, it has in general three solutions when set to 
zero. In the example, two of them are minimums and one is a maximum 
(around x = —1.4). To check whether a stationary point is a minimum 
or maximum, we need to take the derivative a second time and check 
whether the second derivative is positive or negative at the stationary 
point. In our case, the second derivative is 





ay 

CAD 2 12x? + 422 + 10. (7.3) 
dz? 

By substituting our visually estimated values of x = —4.5, —1.4, 0.7, we 

will observe that as expected the middle point is a maximum (SE ) < 0) 


and the other two stationary points are minimums. 

Note that we have avoided analytically solving for values of x in the 
previous discussion, although for low-order polynomials such as the pre- 
ceding we could do so. In general, we are unable to find analytic solu- 
tions, and hence we need to start at some value, say £o = —6, and follow 
the negative gradient. The negative gradient indicates that we should go 
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right, but not how far (this is called the step-size). Furthermore, if we 
had started at the right side (e.g., x) = 0) the negative gradient would 
have led us to the wrong minimum. Figure 7.2 illustrates the fact that for 
x > —1, the negative gradient points toward the minimum on the right of 
the figure, which has a larger objective value. 

In Section 7.3, we will learn about a class of functions, called convex 
functions, that do not exhibit this tricky dependency on the starting point 
of the optimization algorithm. For convex functions, all local minimums 
are global minimum. It turns out that many machine learning objective 
functions are designed such that they are convex, and we will see an ex- 
ample in Chapter 12. 

The discussion in this chapter so far was about a one-dimensional func- 
tion, where we are able to visualize the ideas of gradients, descent direc- 
tions, and optimal values. In the rest of this chapter we develop the same 
ideas in high dimensions. Unfortunately, we can only visualize the con- 
cepts in one dimension, but some concepts do not generalize directly to 
higher dimensions, therefore some care needs to be taken when reading. 


7.1 Optimization Using Gradient Descent 


We now consider the problem of solving for the minimum of a real-valued 
function 


min f(x), (7.4) 
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Figure 7.2 Example 
objective function. 
Negative gradients 
are indicated by 
arrows, and the 
global minimum is 
indicated by the 
dashed blue line. 


According to the 
Abel-Ruffini 
theorem, there is in 
general no algebraic 
solution for 
polynomials of 
degree 5 or more 
(Abel, 1826). 


For convex functions 
all local minima are 
global minimum. 


We use the 
convention of row 
vectors for 
gradients. 
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where f : R¢ — R is an objective function that captures the machine 
learning problem at hand. We assume that our function f is differentiable, 
and we are unable to analytically find a solution in closed form. 

Gradient descent is a first-order optimization algorithm. To find a local 
minimum of a function using gradient descent, one takes steps propor- 
tional to the negative of the gradient of the function at the current point. 
Recall from Section 5.1 that the gradient points in the direction of the 
steepest ascent. Another useful intuition is to consider the set of lines 
where the function is at a certain value (f(x) = c for some value c € R), 
which are known as the contour lines. The gradient points in a direction 
that is orthogonal to the contour lines of the function we wish to optimize. 

Let us consider multivariate functions. Imagine a surface (described by 
the function f(æ)) with a ball starting at a particular location £o. When 
the ball is released, it will move downhill in the direction of steepest de- 
scent. Gradient descent exploits the fact that f (xo) decreases fastest if one 
moves from xo in the direction of the negative gradient —((Vf)(ao))' of 
f at £o. We assume in this book that the functions are differentiable, and 
refer the reader to more general settings in Section 7.4. Then, if 


zı = £o — Y((V f) (£0))' (7.5) 


for a small step-size y > 0, then f(æxı) < f(xo). Note that we use the 
transpose for the gradient since otherwise the dimensions will not work 
out. 

This observation allows us to define a simple gradient descent algo- 
rithm: If we want to find a local optimum f(x.) of a function f : R” —> 
R, x > f(x), we start with an initial guess £o of the parameters we wish 
to optimize and then iterate according to 


Tizi = £; — V: ((V f) (2:))". (7.6) 
For suitable step-size y;, the sequence f(xo) > f(xı) 2... converges to 


a local minimum. 


Example 7.1 
Consider a quadratic function in two dimensions 


a 


with gradient 
vT T 
Tı _ {et 2 1 5 
zi a A A i A Ñ # ; (7.8) 


Starting at the initial location £o = [—3, —1]', we iteratively apply (7.6) 
to obtain a sequence of estimates that converge to the minimum value 
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(illustrated in Figure 7.3). We can see (both from the figure and by plug- 
ging Zo into (7.8) with 7 = 0.085) that the negative gradient at a points 
north and east, leading to x, = [—1.98, 1.21]'. Repeating that argument 
gives us £2 = [—1.32, —0.42]", and so on. 


Remark. Gradient descent can be relatively slow close to the minimum: 
Its asymptotic rate of convergence is inferior to many other methods. Us- 
ing the ball rolling down the hill analogy, when the surface is a long, thin 
valley, the problem is poorly conditioned (Trefethen and Bau III, 1997). 
For poorly conditioned convex problems, gradient descent increasingly 
“zigzags” as the gradients point nearly orthogonally to the shortest di- 
rection to a minimum point; see Figure 7.3. © 


7.1.1 Step-size 


As mentioned earlier, choosing a good step-size is important in gradient 
descent. If the step-size is too small, gradient descent can be slow. If the 
step-size is chosen too large, gradient descent can overshoot, fail to con- 
verge, or even diverge. We will discuss the use of momentum in the next 
section. It is a method that smoothes out erratic behavior of gradient up- 
dates and dampens oscillations. 

Adaptive gradient methods rescale the step-size at each iteration, de- 
pending on local properties of the function. There are two simple heuris- 
tics (Toussaint, 2012): 


= When the function value increases after a gradient step, the step-size 
was too large. Undo the step and decrease the step-size. 

= When the function value decreases the step could have been larger. Try 
to increase the step-size. 
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Although the “undo” step seems to be a waste of resources, using this 
heuristic guarantees monotonic convergence. 


Example 7.2 (Solving a Linear Equation System) 
When we solve linear equations of the form Ags = b, in practice we solve 
Az —b = 0 approximately by finding x., that minimizes the squared error 


|| Aa — bl|? = (Aw — b)' (Aa — b) (7.9) 


if we use the Euclidean norm. The gradient of (7.9) with respect to æ is 


Va =2(Ax —b)"A. (7.10) 





We can use this gradient directly in a gradient descent algorithm. How- 
ever, for this particular special case, it turns out that there is an analytic 
solution, which can be found by setting the gradient to zero. We will see 
more on solving squared error problems in Chapter 9. 


Remark. When applied to the solution of linear systems of equations Aw = 
b, gradient descent may converge slowly. The speed of convergence of gra- 
dient descent is dependent on the condition number k = aA Joon which 
is the ratio of the maximum to the minimum singular value (Section 4.5) 
of A. The condition number essentially measures the ratio of the most 
curved direction versus the least curved direction, which corresponds to 
our imagery that poorly conditioned problems are long, thin valleys: They 
are very curved in one direction, but very flat in the other. Instead of di- 
rectly solving Ax = b, one could instead solve P~'( Aa — b) = 0, where 
P is called the preconditioner. The goal is to design P~' such that P™'A 
has a better condition number, but at the same time P™' is easy to com- 
pute. For further information on gradient descent, preconditioning, and 
convergence we refer to Boyd and Vandenberghe (2004, chapter 9). © 





7.1.2 Gradient Descent With Momentum 


As illustrated in Figure 7.3, the convergence of gradient descent may be 
very slow if the curvature of the optimization surface is such that there 
are regions that are poorly scaled. The curvature is such that the gradient 
descent steps hops between the walls of the valley and approaches the 
optimum in small steps. The proposed tweak to improve convergence is 
to give gradient descent some memory. 

Gradient descent with momentum (Rumelhart et al., 1986) is a method 
that introduces an additional term to remember what happened in the 
previous iteration. This memory dampens oscillations and smoothes out 
the gradient updates. Continuing the ball analogy, the momentum term 
emulates the phenomenon of a heavy ball that is reluctant to change di- 
rections. The idea is to have a gradient update with memory to implement 
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a moving average. The momentum-based method remembers the update 
Az; at each iteration į and determines the next update as a linear combi- 
nation of the current and previous gradients 


Ax; = Ti — H-1 = aAx;-1 = yi-1((Vf)(@i-1))" 5 (7.12) 


where a € [0,1]. Sometimes we will only know the gradient approxi- 
mately. In such cases, the momentum term is useful since it averages out 
different noisy estimates of the gradient. One particularly useful way to 
obtain an approximate gradient is by using a stochastic approximation, 
which we discuss next. 


7.1.3 Stochastic Gradient Descent 


Computing the gradient can be very time consuming. However, often it is 
possible to find a “cheap” approximation of the gradient. Approximating 
the gradient is still useful as long as it points in roughly the same direction 
as the true gradient. 

Stochastic gradient descent (often shortened as SGD) is a stochastic ap- 
proximation of the gradient descent method for minimizing an objective 
function that is written as a sum of differentiable functions. The word 
stochastic here refers to the fact that we acknowledge that we do not 
know the gradient precisely, but instead only know a noisy approxima- 
tion to it. By constraining the probability distribution of the approximate 
gradients, we can still theoretically guarantee that SGD will converge. 

In machine learning, given n = 1,..., N data points, we often consider 
objective functions that are the sum of the losses L,, incurred by each 
example n. In mathematical notation, we have the form 


L(0) = S5 L00) , (7.13) 


where @ is the vector of parameters of interest, i.e., we want to find @ that 
minimizes L. An example from regression (Chapter 9) is the negative log- 
likelihood, which is expressed as a sum over log-likelihoods of individual 
examples so that 


N 
L(0) = — X log p(yn|®n, 0), (7.14) 


n=1 


where x, € R” are the training inputs, y,, are the training targets, and 0 
are the parameters of the regression model. 

Standard gradient descent, as introduced previously, is a “batch” opti- 
mization method, i.e., optimization is performed using the full training set 
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by updating the vector of parameters according to 


N 
6:41 = 0; — Vi (VL(0:))" = 0i — y: X (V La (0:))" (7.15) 
n=1 
for a suitable step-size parameter y;. Evaluating the sum gradient may re- 
quire expensive evaluations of the gradients from all individual functions 
L,. When the training set is enormous and/or no simple formulas exist, 
evaluating the sums of gradients becomes very expensive. 
Consider the term >, (VL,,(0;)) in (7.15), we can reduce the amount 
of computation by taking a sum over a smaller set of L,,. In contrast to 
batch gradient descent, which uses all L,, forn = 1,...,N, we randomly 
choose a subset of L,, for mini-batch gradient descent. In the extreme 
case, we randomly select only a single L,, to estimate the gradient. The 
key insight about why taking a subset of data is sensible is to realize that 
for gradient descent to converge, we only require that the gradient is an 
unbiased estimate of the true gradient. In fact the term X2, (V Ln (0:)) 
in (7.15) is an empirical estimate of the expected value (Section 6.4.1) of 
the gradient. Therefore, any other unbiased empirical estimate of the ex- 
pected value, for example using any subsample of the data, would suffice 
for convergence of gradient descent. 


Remark. When the learning rate decreases at an appropriate rate, and sub- 
ject to relatively mild assumptions, stochastic gradient descent converges 
almost surely to local minimum (Bottou, 1998). © 


Why should one consider using an approximate gradient? A major rea- 
son is practical implementation constraints, such as the size of central 
processing unit (CPU)/graphics processing unit (GPU) memory or limits 
on computational time. We can think of the size of the subset used to esti- 
mate the gradient in the same way that we thought of the size of a sample 
when estimating empirical means (Section 6.4.1). Large mini-batch sizes 
will provide accurate estimates of the gradient, reducing the variance in 
the parameter update. Furthermore, large mini-batches take advantage of 
highly optimized matrix operations in vectorized implementations of the 
cost and gradient. The reduction in variance leads to more stable conver- 
gence, but each gradient calculation will be more expensive. 

In contrast, small mini-batches are quick to estimate. If we keep the 
mini-batch size small, the noise in our gradient estimate will allow us to 
get out of some bad local optima, which we may otherwise get stuck in. 
In machine learning, optimization methods are used for training by min- 
imizing an objective function on the training data, but the overall goal 
is to improve generalization performance (Chapter 8). Since the goal in 
machine learning does not necessarily need a precise estimate of the min- 
imum of the objective function, approximate gradients using mini-batch 
approaches have been widely used. Stochastic gradient descent is very 
effective in large-scale machine learning problems (Bottou et al., 2018), 
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such as training deep neural networks on millions of images (Dean et al., 
2012), topic models (Hoffman et al., 2013), reinforcement learning (Mnih 
et al., 2015), or training of large-scale Gaussian process models (Hensman 
et al., 2013; Gal et al., 2014). 


7.2 Constrained Optimization and Lagrange Multipliers 


In the previous section, we considered the problem of solving for the min- 
imum of a function 


min f(x) , (7.16) 
where f : RP —> R. 

In this section, we have additional constraints. That is, for real-valued 
functions g; : R? — R fori = 1,...,m, we consider the constrained 
optimization problem (see Figure 7.4 for an illustration) 


min f(x) (7.17) 


subject to g;(x) <0 forall i=1,...,m. 


It is worth pointing out that the functions f and g; could be non-convex 
in general, and we will consider the convex case in the next section. 

One obvious, but not very practical, way of converting the constrained 
problem (7.17) into an unconstrained one is to use an indicator function 


Jæ) = fæ) +o Uae), 7.18) 
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where 1(z) is an infinite step function 


0 ifz<0 
1(z) = ae (7.19) 
oo otherwise 


This gives infinite penalty if the constraint is not satisfied, and hence 
would provide the same solution. However, this infinite step function is 
equally difficult to optimize. We can overcome this difficulty by introduc- 
ing Lagrange multipliers. The idea of Lagrange multipliers is to replace the 
step function with a linear function. 

We associate to problem (7.17) the Lagrangian by introducing the La- 
grange multipliers A; > 0 corresponding to each inequality constraint re- 
spectively (Boyd and Vandenberghe, 2004, chapter 4) so that 


m 


L(x, dr) = f(a) + X Agi (®) (7.20a) 
= f(a) +A'g(ax), (7.20b) 


where in the last line we have concatenated all constraints g;(a) into a 
vector g(x), and all the Lagrange multipliers into a vector A € R”. 

We now introduce the idea of Lagrangian duality. In general, duality 
in optimization is the idea of converting an optimization problem in one 
set of variables æ (called the primal variables), into another optimization 
problem in a different set of variables A (called the dual variables). We 
introduce two different approaches to duality: In this section, we discuss 
Lagrangian duality; in Section 7.3.3, we discuss Legendre-Fenchel duality. 


Definition 7.1. The problem in (7.17) 
min f(x) (7.21) 
subject to g;(x) <0 forall i=1,...,m 


is known as the primal problem, corresponding to the primal variables x. 
The associated Lagrangian dual problem is given by 


max D(A) 
eee (7.22) 
subject to ALSO, 


where A are the dual variables and D(A) = mingcrs L(x, A). 


Remark. In the discussion of Definition 7.1, we use two concepts that are 
also of independent interest (Boyd and Vandenberghe, 2004). 

First is the minimax inequality, which says that for any function with 
two arguments y(x, y), the maximin is less than the minimax, i.e., 


max min (æ, y) < min max y(x, y). (7.23) 
y æ s y 
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This inequality can be proved by considering the inequality 


For all x,y min y(x, y) < max y(x, y). (7.24) 
x y 


Note that taking the maximum over y of the left-hand side of (7.24) main- 
tains the inequality since the inequality is true for all y. Similarly, we can 
take the minimum over g of the right-hand side of (7.24) to obtain (7.23). 

The second concept is weak duality, which uses (7.23) to show that 
primal values are always greater than or equal to dual values. This is de- 
scribed in more detail in (7.27). $ 


Recall that the difference between J(x) in (7.18) and the Lagrangian 
in (7.20b) is that we have relaxed the indicator function to a linear func- 
tion. Therefore, when à > 0, the Lagrangian £(a, A) is a lower bound of 
J(x). Hence, the maximum of £(x, A) with respect to A is 


J(x) = max £(zx, A). (7.25) 
ASO 
Recall that the original problem was minimizing J (æ), 
min max £(a, A). (7.26) 
seR? AZO 


By the minimax inequality (7.23), it follows that swapping the order of 
the minimum and maximum results in a smaller value, i.e., 
min max £(x, A) > max min L(g, A). (7.27) 
xER? AZO A20 weR4 
This is also known as weak duality. Note that the inner part of the right- 
hand side is the dual objective function D(A) and the definition follows. 
In contrast to the original optimization problem, which has constraints, 
Mingzera L(x, A) is an unconstrained optimization problem for a given 
value of A. If solving min,cra £(x, A) is easy, then the overall problem is 
easy to solve. We can see this by observing from (7.20b) that £(a, A) is 
affine with respect to A. Therefore minger: £(x, A) is a pointwise min- 
imum of affine functions of A, and hence D(A) is concave even though 
f(-) and g;(-) may be nonconvex. The outer problem, maximization over 
A, is the maximum of a concave function and can be efficiently computed. 
Assuming f(-) and g;(-) are differentiable, we find the Lagrange dual 
problem by differentiating the Lagrangian with respect to x, setting the 
differential to zero, and solving for the optimal value. We will discuss two 
concrete examples in Sections 7.3.1 and 7.3.2, where f(-) and g;(-) are 
convex. 


Remark (Equality Constraints). Consider (7.17) with additional equality 
constraints 


min f(a) 
subject to gi(w) <0 forall i=1,...,m (7.28) 
h;(x)=0 foral j=1,...,n. 
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We can model equality constraints by replacing them with two inequality 
constraints. That is for each equality constraint h;(a) = 0 we equivalently 
replace it by two constraints h;(a) < 0 and h;(a) > 0. It turns out that 
the resulting Lagrange multipliers are then unconstrained. 

Therefore, we constrain the Lagrange multipliers corresponding to the 
inequality constraints in (7.28) to be non-negative, and leave the La- 
grange multipliers corresponding to the equality constraints unconstrained. 


Q 


7.3 Convex Optimization 


We focus our attention of a particularly useful class of optimization prob- 
lems, where we can guarantee global optimality. When f(-) is a convex 
function, and when the constraints involving g(-) and h(-) are convex sets, 
this is called a convex optimization problem. In this setting, we have strong 
duality: The optimal solution of the dual problem is the same as the opti- 
mal solution of the primal problem. The distinction between convex func- 
tions and convex sets are often not strictly presented in machine learning 
literature, but one can often infer the implied meaning from context. 


Definition 7.2. A set C is a convex set if for any x,y € C and for any scalar 
6 with 0 < 6 < 1, we have 


dx +(1—O)yeEC. (7.29) 


Convex sets are sets such that a straight line connecting any two ele- 
ments of the set lie inside the set. Figures 7.5 and 7.6 illustrate convex 
and nonconvex sets, respectively. 

Convex functions are functions such that a straight line between any 
two points of the function lie above the function. Figure 7.2 shows a non- 
convex function, and Figure 7.3 shows a convex function. Another convex 
function is shown in Figure 7.7. 


Definition 7.3. Let function f : RP — R bea function whose domain is a 
convex set. The function f is a convex function if for all æ, y in the domain 
of f, and for any scalar 0 with 0 < 0 < 1, we have 


f(0x + (1-— 0)y) < 0f(x) + (1-0) f(y). (7.30) 


Remark. A concave function is the negative of a convex function. ro 


The constraints involving g(-) and h(-) in (7.28) truncate functions at a 
scalar value, resulting in sets. Another relation between convex functions 
and convex sets is to consider the set obtained by “filling in” a convex 
function. A convex function is a bowl-like object, and we imagine pouring 
water into it to fill it up. This resulting filled-in set, called the epigraph of 
the convex function, is a convex set. 

If a function f : R” — R is differentiable, we can specify convexity in 
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terms of its gradient V.f(a) (Section 5.2). A function f(a) is convex if 
and only if for any two points æ, y it holds that 


f(y) > f(x) + Vaf(a)'(y—«). (7.31) 


If we further know that a function f(a) is twice differentiable, that is, the 
Hessian (5.147) exists for all values in the domain of x, then the function 
f(a) is convex if and only if V2, f(a) is positive semidefinite (Boyd and 
Vandenberghe, 2004). 


Example 7.3 

The negative entropy f(x) = x log, x is convex for x > 0. A visualization 
of the function is shown in Figure 7.8, and we can see that the function is 
convex. To illustrate the previous definitions of convexity, let us check the 
calculations for two points x = 2 and x = 4. Note that to prove convexity 
of f(x) we would need to check for all points x € R. 

Recall Definition 7.3. Consider a point midway between the two points 
(that is 9 = 0.5); then the left-hand side is f(0.5-2+0.5-4) = 3log,3 ~ 
4.75. The right-hand side is 0.5(2 log, 2) + 0.5(4log, 4) = 1+4=5. And 
therefore the definition is satisfied. 

Since f(x) is differentiable, we can alternatively use (7.31). Calculating 
the derivative of f(a), we obtain 


1 
= || ——.. 32 
aoe a ee 9 (7.32) 


e 


Val(xlogs £) = 1- loga £ +z- e. 


Using the same two test points x = 2 and x = 4, the left-hand side of 
(7.31) is given by f(4) = 8. The right-hand side is 


f(w) + VE(y — 2) = f(2) + VP(2)- (4-2) (7.33a) 


=2+(1+ )- 226.9. (7.33b) 


log, 2 
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We can check that a function or set is convex from first principles by 
recalling the definitions. In practice, we often rely on operations that pre- 
serve convexity to check that a particular function or set is convex. Al- 
though the details are vastly different, this is again the idea of closure 
that we introduced in Chapter 2 for vector spaces. 


Example 7.4 
A nonnegative weighted sum of convex functions is convex. Observe that 
if f is a convex function, and a > 0 is a nonnegative scalar, then the 
function af is convex. We can see this by multiplying a to both sides of the 
equation in Definition 7.3, and recalling that multiplying a nonnegative 
number does not change the inequality. 

If fı and fz are convex functions, then we have by the definition 


fi(Ox + (1 -— 0)y) < Ofi(w) + 1 — 0) fiy) (7.34) 
f,(Ox + (1—@)y) < Ofo(x) + 1 — 0) f2(y). (7.35) 


Summing up both sides gives us 


fi(Ox + (1—O)y) + fo(Ox + (1 — A)y) 





< Ofi(x) + 1 — A) fily) + Ofale) + 1 — 0) f2(y), (7.36) 
where the right-hand side can be rearranged to 
O( fila) + fo(e)) + (1 — @)(fi(y) + foly)) (7.37) 


completing the proof that the sum of convex functions is convex. 

Combining the preceding two facts, we see that af;(x) + 3 fo(ax) is 
convex for a, 8 > 0. This closure property can be extended using a sim- 
ilar argument for nonnegative weighted sums of more than two convex 
functions. 
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Remark. The inequality in (7.30) is sometimes called Jensen’s inequality. 
In fact, a whole class of inequalities for taking nonnegative weighted sums 
of convex functions are all called Jensen’s inequality. © 


In summary, a constrained optimization problem is called a convex opti- 
mization problem if 


min f(a) 
subject to g(a) <0 forall i=1,...,m (7.38) 
h;(x)=0 foral j=1,...,n, 
where all functions f(a) and g;(a) are convex functions, and all h; (x) = 


0 are convex sets. In the following, we will describe two classes of convex 
optimization problems that are widely used and well understood. 


7.3.1 Linear Programming 


Consider the special case when all the preceding functions are linear, i.e., 


min c'e (7.39) 
xreERi 


subject to Aa <b, 


where A € R™*¢ and b € R”. This is known as a linear program. It has d 
variables and m linear constraints. The Lagrangian is given by 


L(z, A) =c'a+2r' (Ax — bd), (7.40) 


where A € R” is the vector of non-negative Lagrange multipliers. Rear- 
ranging the terms corresponding to æ yields 


L(æ, A) = (c+ A'A) — A'b. (7.41) 


Taking the derivative of L(x, A) with respect to x and setting it to zero 
gives us 


c+A'XN=O. (7.42) 


Therefore, the dual Lagrangian is D(A) = —A'b. Recall we would like 
to maximize D(A). In addition to the constraint due to the derivative of 
L(x, A) being zero, we also have the fact that X > O, resulting in the 
following dual optimization problem 


max —b'X (7.43) 
dAER™ 
subject to c+A'A=O0 


A>O0. 


This is also a linear program, but with m variables. We have the choice 
of solving the primal (7.39) or the dual (7.43) program depending on 
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whether m or d is larger. Recall that d is the number of variables and m is 
the number of constraints in the primal linear program. 


Example 7.5 (Linear Program) 
Consider the linear program 
i 5 a Ly 
3 v2 


min 
«eR? 
3 @ 33 
2 4 8 (7.44) 
subject to —2 1 <5 
0 -1| '? —1 
0 1 8 


with two variables. This program is also shown in Figure 7.9. The objective 
function is linear, resulting in linear contour lines. The constraint set in 
standard form is translated into the legend. The optimal value must lie in 
the shaded (feasible) region, and is indicated by the star. 








A A A 
2x2 < 33 — 2rÌ 
—— 4% >2m-8 \ 
\ \ — tv S$ 2a, 55 
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7.3.2 Quadratic Programming 


Consider the case of a convex quadratic objective function, where the con- 
straints are affine, i.e., 


1 
min -x Qr +e'laz (7.45) 
xeER4 2 


subject to Aa <b, 


where A € R™”*4, b € R”, and c € R?. The square symmetric matrix Q € 
R2*? is positive definite, and therefore the objective function is convex. 
This is known as a quadratic program. Observe that it has d variables and 
m linear constraints. 


Example 7.6 (Quadratic Program) 
Consider the quadratic program 


å 1 Tı n eal Ly 5 i Ly 
1 0 1 
<A 
subject to 0 1 ea < |i (7.47) 
0 -l 1 


of two variables. The program is also illustrated in Figure 7.4. The objec- 
tive function is quadratic with a positive semidefinite matrix Q, resulting 
in elliptical contour lines. The optimal value must lie in the shaded (feasi- 
ble) region, and is indicated by the star. 


The Lagrangian is given by 








1- 
L(x, A) = z7 Qx +c'x+A' (Az -— b) (7.48a) 
= 5 Qe + (e+ ATA) He — Xd, (7.48b) 


where again we have rearranged the terms. Taking the derivative of L(x, A) 
with respect to x and setting it to zero gives 


Qz+(c+A'A)=0. (7.49) 
Assuming that Q is invertible, we get 
x =-Q (e+ A'A). (7.50) 


Substituting (7.50) into the primal Lagrangian £(a, A), we get the dual 
Lagrangian 


D(A) = -F(e+ ATA)Q e+ ATA) — A'b. (7.51) 
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Therefore, the dual optimization problem is given by 


1 Ty)TQ-!1 T T. 
max —5(e+A A) QU(e+A A)-—A bd (7.52) 


subject to ALSO. 


We will see an application of quadratic programming in machine learning 
in Chapter 12. 


7.3.3 Legendre-Fenchel Transform and Convex Conjugate 


Let us revisit the idea of duality from Section 7.2, without considering 
constraints. One useful fact about a convex set is that it can be equiva- 
lently described by its supporting hyperplanes. A hyperplane is called a 
supporting hyperplane of a convex set if it intersects the convex set, and 
the convex set is contained on just one side of it. Recall that we can fill up 
a convex function to obtain the epigraph, which is a convex set. Therefore, 
we can also describe convex functions in terms of their supporting hyper- 
planes. Furthermore, observe that the supporting hyperplane just touches 
the convex function, and is in fact the tangent to the function at that 
point. And recall that the tangent of a function f(a) at a given point a 
df (x) 

da gan 
In summary, because convex sets can be equivalently described by its sup- 
porting hyperplanes, convex functions can be equivalently described by a 
function of their gradient. The Legendre transform formalizes this concept. 


is the evaluation of the gradient of that function at that point 





We begin with the most general definition, which unfortunately has a 
counter-intuitive form, and look at special cases to relate the definition to 
the intuition described in the preceding paragraph. The Legendre-Fenchel 
transform is a transformation (in the sense of a Fourier transform) from 
a convex differentiable function f(a) to a function that depends on the 
tangents s(x) = Va» f(x). It is worth stressing that this is a transformation 
of the function f(-) and not the variable æ or the function evaluated at a. 
The Legendre-Fenchel transform is also known as the convex conjugate (for 
reasons we will see soon) and is closely related to duality (Hiriart-Urruty 
and Lemaréchal, 2001, chapter 5). 


Definition 7.4. The convex conjugate of a function f : R? > Risa 
function f* defined by 


f*(s) = sup ((8, a) — f(#)) - (7.53) 


xERP 


Note that the preceding convex conjugate definition does not need the 
function f to be convex nor differentiable. In Definition 7.4, we have used 
a general inner product (Section 3.2) but in the rest of this section we 
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will consider the standard dot product between finite-dimensional vectors 
((s,a) = s'a) to avoid too many technical details. 

To understand Definition 7.4 in a geometric fashion, consider a nice 
simple one-dimensional convex and differentiable function, for example 
f(a) = x?. Note that since we are looking at a one-dimensional problem, 
hyperplanes reduce to a line. Consider a line y = sz +c. Recall that we are 
able to describe convex functions by their supporting hyperplanes, so let 
us try to describe this function f(x) by its supporting lines. Fix the gradi- 
ent of the line s € R and for each point (xo, f(xo)) on the graph of f, find 
the minimum value of c such that the line still intersects (xo, f(xo)). Note 
that the minimum value of c is the place where a line with slope s “just 
touches” the function f(x) = x°. The line passing through (zo, f(x£o)) 
with gradient s is given by 


y — f (£0) = s(x — z0). (7.54) 


The y-intercept of this line is —szo + f (xo). The minimum of c for which 
y = sx + c intersects with the graph of f is therefore 


inf —sxo + f(Xo). (7.55) 


The preceding convex conjugate is by convention defined to be the nega- 
tive of this. The reasoning in this paragraph did not rely on the fact that 
we chose a one-dimensional convex and differentiable function, and holds 
for f : R? — R, which are nonconvex and non-differentiable. 


Remark. Convex differentiable functions such as the example f(x) = x? is 
a nice special case, where there is no need for the supremum, and there is 
a one-to-one correspondence between a function and its Legendre trans- 
form. Let us derive this from first principles. For a convex differentiable 
function, we know that at xo the tangent touches f (zo) so that 


f(£o) = szo +c. (7.56) 


Recall that we want to describe the convex function f(x) in terms of its 
gradient V, f(x), and that s = V,,f (ao). We rearrange to get an expres- 
sion for —c to obtain 


— c = Sto — f (xo). (7.57) 


Note that —c changes with xo and therefore with s, which is why we can 
think of it as a function of s, which we call 


f(s) := szo — f (zo). (7.58) 
Comparing (7.58) with Definition 7.4, we see that (7.58) is a special case 
(without the supremum). Q 


The conjugate function has nice properties; for example, for convex 
functions, applying the Legendre transform again gets us back to the orig- 
inal function. In the same way that the slope of f (x) is s, the slope of f* (s) 
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drawing the 
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The classical 
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is defined on convex 
differentiable 
functions in R?. 
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is x. The following two examples show common uses of convex conjugates 
in machine learning. 


Example 7.7 (Convex Conjugates) 
To illustrate the application of convex conjugates, consider the quadratic 
function 


À 
f(y) = yy Ky (7.59) 


based on a positive definite matrix K ¢€ R”*”. We denote the primal 
variable to be y € R” and the dual variable to be a € R”. 
Applying Definition 7.4, we obtain the function 


À 
f(a) = sup (y, a) — gy Ky. (7.60) 
ee 


Since the function is differentiable, we can find the maximum by taking 
the derivative and with respect to y setting it to zero. 


ð [ly a) — èy Ky] 
Oy 


and hence when the gradient is zero we have y = iK a. Substituting 
into (7.60) yields 





= (æa — AK `y)" (7.61) 


5 
TO= ya Ka — (Ka) Kk" (Ka) = po Ka. 
(7.62) 


Example 7.8 

In machine learning, we often use sums of functions; for example, the ob- 
jective function of the training set includes a sum of the losses for each ex- 
ample in the training set. In the following, we derive the convex conjugate 
of a sum of losses ¢(t), where ¢ : R — R. This also illustrates the appli- 
cation of the convex conjugate to the vector case. Let L(t) = >>)", ¢;(ti). 
Then, 


L*(z) = sup (z, t) — I (7.63a) 
tcR” a 
= sup ` ziti — €;(t,) definition of dot product (7.63b) 
teR” 7 
= So sup ziti — 4 (ti) (7.630) 
i teR” 
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ZN definition of conjugate (7.63d) 
(= 


Recall that in Section 7.2 we derived a dual optimization problem using 
Lagrange multipliers. Furthermore, for convex optimization problems we 
have strong duality, that is the solutions of the primal and dual problem 
match. The Legendre-Fenchel transform described here also can be used 
to derive a dual optimization problem. Furthermore, when the function 
is convex and differentiable, the supremum is unique. To further investi- 
gate the relation between these two approaches, let us consider a linear 
equality constrained convex optimization problem. 


Example 7.9 
Let f(y) and g(a) be convex functions, and A a real matrix of appropriate 
dimensions such that Aæ = y. Then 


min f(As) + g(x) = min f(y) + g(s). (7.64) 
By introducing the Lagrange multiplier u for the constraints Ax = y, 


min f(y) + 9(@) = minmax f(y) + 9(@)+(Aw—y)'u  (7.65a) 








= maxmin f(y) + 9(a)+(Ax—y)'u,  (7.65b) 
u «ay 


where the last step of swapping max and min is due to the fact that f(y) 
and g(a) are convex functions. By splitting up the dot product term and 
collecting x and y, 


max min fly) +9(@) + (Ax — y)'u (7.66a) 


= max nin -y'u + Fw] + [min(Ax) u + a(x)| (7.66b) 














= max [min -y'u + Fw) + [min a’ Alut g(æ)] (7.66c) 
u y 


xz 


Recall the convex conjugate (Definition 7.4) and the fact that dot prod- 
ucts are symmetric, 


max [nin —y' u+ rw] + [min xr A'u+ a(x)| (7.67a) 
u y x 
= max —f*(u) — g(-A'u). (7.67b) 


Therefore, we have shown that 


min f(A) + g(x) = max —f*(u) — g(-A'u). (7.68) 
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The Legendre-Fenchel conjugate turns out to be quite useful for ma- 
chine learning problems that can be expressed as convex optimization 
problems. In particular, for convex loss functions that apply independently 
to each example, the conjugate loss is a convenient way to derive a dual 
problem. 


7.4 Further Reading 


Continuous optimization is an active area of research, and we do not try 
to provide a comprehensive account of recent advances. 

From a gradient descent perspective, there are two major weaknesses 
which each have their own set of literature. The first challenge is the fact 
that gradient descent is a first-order algorithm, and does not use infor- 
mation about the curvature of the surface. When there are long valleys, 
the gradient points perpendicularly to the direction of interest. The idea 
of momentum can be generalized to a general class of acceleration meth- 
ods (Nesterov, 2018). Conjugate gradient methods avoid the issues faced 
by gradient descent by taking previous directions into account (Shewchuk, 
1994). Second-order methods such as Newton methods use the Hessian to 
provide information about the curvature. Many of the choices for choos- 
ing step-sizes and ideas like momentum arise by considering the curvature 
of the objective function (Goh, 2017; Bottou et al., 2018). Quasi-Newton 
methods such as L-BFGS try to use cheaper computational methods to ap- 
proximate the Hessian (Nocedal and Wright, 2006). Recently there has 
been interest in other metrics for computing descent directions, result- 
ing in approaches such as mirror descent (Beck and Teboulle, 2003) and 
natural gradient (Toussaint, 2012). 

The second challenge is to handle non-differentiable functions. Gradi- 
ent methods are not well defined when there are kinks in the function. 
In these cases, subgradient methods can be used (Shor, 1985). For fur- 
ther information and algorithms for optimizing non-differentiable func- 
tions, we refer to the book by Bertsekas (1999). There is a vast amount 
of literature on different approaches for numerically solving continuous 
optimization problems, including algorithms for constrained optimization 
problems. Good starting points to appreciate this literature are the books 
by Luenberger (1969) and Bonnans et al. (2006). A recent survey of con- 
tinuous optimization is provided by Bubeck (2015). 

Modern applications of machine learning often mean that the size of 
datasets prohibit the use of batch gradient descent, and hence stochastic 
gradient descent is the current workhorse of large-scale machine learning 
methods. Recent surveys of the literature include Hazan (2015) and Bot- 
tou et al. (2018). 

For duality and convex optimization, the book by Boyd and Vanden- 
berghe (2004) includes lectures and slides online. A more mathematical 
treatment is provided by Bertsekas (2009), and recent book by one of 
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the key researchers in the area of optimization is Nesterov (2018). Con- 
vex optimization is based upon convex analysis, and the reader interested 
in more foundational results about convex functions is referred to Rock- 
afellar (1970), Hiriart-Urruty and Lemaréchal (2001), and Borwein and 
Lewis (2006). Legendre—Fenchel transforms are also covered in the afore- 
mentioned books on convex analysis, but a more beginner-friendly pre- 
sentation is available at Zia et al. (2009). The role of Legendre—Fenchel 
transforms in the analysis of convex optimization algorithms is surveyed 
in Polyak (2016). 


7.1 


7.2 


7.3 


7.4 


7.5 


7.6 


Exercises 


Consider the univariate function 
f(w) = 2° + 62” — 37-5. 


Find its stationary points and indicate whether they are maximum, mini- 
mum, or saddle points. 

Consider the update equation for stochastic gradient descent (Equation (7.15)). 
Write down the update when we use a mini-batch size of one. 

Consider whether the following statements are true or false: 


a. The intersection of any two convex sets is convex. 
b. The union of any two convex sets is convex. 
c. The difference of a convex set A from another convex set B is convex. 


Consider whether the following statements are true or false: 


a. The sum of any two convex functions is convex. 

b. The difference of any two convex functions is convex. 
c. The product of any two convex functions is convex. 
d. The maximum of any two convex functions is convex. 


Express the following optimization problem as a standard linear program in 
matrix notation 


max pi« +E 
weR?, €€R 


subject to the constraints that € > 0, zo < O and x; < 3. 
Consider the linear program illustrated in Figure 7.9, 


; 5 7 ry 
min — 
«eR? 3 T2 


2 2 33 
2 —4 8 
subject to |—2 1 | <]5 
o i| me J= 
0 1 8 


Derive the dual linear program using Lagrange duality. 
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7.7 Consider the quadratic program illustrated in Figure 7.4, 


“4 Teh fae AT a EB Fea 
min = + 
x2eER22 | x2 1 4| |x2 3 x2 
1 


0 1 

; —1 0 x 1 
subject to 0 1 | < 1 
1 


0 -1 


Derive the dual quadratic program using Lagrange duality. 
7.8 Consider the following convex optimization problem 


S Tam 
mın -wW wW 
weRP 


subject to wie 21. 


Derive the Lagrangian dual by introducing the Lagrange multiplier A. 
7.9 Consider the negative entropy of x € RP, 


D 
f(x) = 5 zalog ta. 
d=1 


Derive the convex conjugate function f*(s), by assuming the standard dot 

product. 

Hint: Take the gradient of an appropriate function and set the gradient to zero. 
7.10 Consider the function 


f(x) = 52 Ae t+blate, 


where A is strictly positive definite, which means that it is invertible. Derive 

the convex conjugate of f(a). 

Hint: Take the gradient of an appropriate function and set the gradient to zero. 
7.11 The hinge loss (which is the loss used by the support vector machine) is 

given by 

L(a) = max{0,1- a}, 

If we are interested in applying gradient methods such as L-BFGS, and do 

not want to resort to subgradient methods, we need to smooth the kink in 

the hinge loss. Compute the convex conjugate of the hinge loss L* (6) where 

b is the dual variable. Add a £2 proximal term, and compute the conjugate 

of the resulting function 


* ae 


where 7¥ is a given hyperparameter. 
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When Models Meet Data 


In the first part of the book, we introduced the mathematics that form 
the foundations of many machine learning methods. The hope is that a 
reader would be able to learn the rudimentary forms of the language of 
mathematics from the first part, which we will now use to describe and 
discuss machine learning. The second part of the book introduces four 
pillars of machine learning: 


« Regression (Chapter 9) 

« Dimensionality reduction (Chapter 10) 
= Density estimation (Chapter 11) 

« Classification (Chapter 12) 


The main aim of this part of the book is to illustrate how the mathematical 
concepts introduced in the first part of the book can be used to design 
machine learning algorithms that can be used to solve tasks within the 
remit of the four pillars. We do not intend to introduce advanced machine 
learning concepts, but instead to provide a set of practical methods that 
allow the reader to apply the knowledge they gained from the first part 
of the book. It also provides a gateway to the wider machine learning 
literature for readers already familiar with the mathematics. 


8.1 Data, Models, and Learning 


It is worth at this point, to pause and consider the problem that a ma- 
chine learning algorithm is designed to solve. As discussed in Chapter 1, 
there are three major components of a machine learning system: data, 
models, and learning. The main question of machine learning is “What do 
we mean by good models?”. The word model has many subtleties, and we 
will revisit it multiple times in this chapter. It is also not entirely obvious 
how to objectively define the word “good”. One of the guiding principles 
of machine learning is that good models should perform well on unseen 
data. This requires us to define some performance metrics, such as accu- 
racy or distance from ground truth, as well as figuring out ways to do well 
under these performance metrics. This chapter covers a few necessary bits 
and pieces of mathematical and statistical language that are commonly 
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Table 8.1 Example 
data from a 
fictitious human 
resource database 
that is not ina 
numerical format. 


Data is assumed to 
be in a tidy 

format (Wickham, 
2014; Codd, 1990). 
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Name Gender Degree Postcode Age Annual salary 
Aditya M MSc W21BG 36 89563 
Bob M PhD EC1A1BA 47 123543 
Chloé F BEcon SW1Al1BH 26 23989 
Daisuke M BSc SE207AT 68 138769 
Elisabeth F MBA SE10AA 33 113888 


used to talk about machine learning models. By doing so, we briefly out- 
line the current best practices for training a model such that the resulting 
predictor does well on data that we have not yet seen. 

As mentioned in Chapter 1, there are two different senses in which we 
use the phrase “machine learning algorithm”: training and prediction. We 
will describe these ideas in this chapter, as well as the idea of selecting 
among different models. We will introduce the framework of empirical 
risk minimization in Section 8.2, the principle of maximum likelihood in 
Section 8.3, and the idea of probabilistic models in Section 8.4. We briefly 
outline a graphical language for specifying probabilistic models in Sec- 
tion 8.5 and finally discuss model selection in Section 8.6. The rest of this 
section expands upon the three main components of machine learning: 
data, models and learning. 


8.1.1 Data as Vectors 


We assume that our data can be read by a computer, and represented ade- 
quately in a numerical format. Data is assumed to be tabular (Figure 8.1), 
where we think of each row of the table as representing a particular in- 
stance or example, and each column to be a particular feature. In recent 
years, machine learning has been applied to many types of data that do not 
obviously come in the tabular numerical format, for example genomic se- 
quences, text and image contents of a webpage, and social media graphs. 
We do not discuss the important and challenging aspects of identifying 
good features. Many of these aspects depend on domain expertise and re- 
quire careful engineering, and, in recent years, they have been put under 
the umbrella of data science (Stray, 2016; Adhikari and DeNero, 2018). 
Even when we have data in tabular format, there are still choices to be 
made to obtain a numerical representation. For example, in Table 8.1, the 
gender column (a categorical variable) may be converted into numbers 0 
representing “Male” and 1 representing “Female”. Alternatively, the gen- 
der could be represented by numbers —1,+1, respectively (as shown in 
Table 8.2). Furthermore, it is often important to use domain knowledge 
when constructing the representation, such as knowing that university 
degrees progress from bachelor’s to master’s to PhD or realizing that the 
postcode provided is not just a string of characters but actually encodes 
an area in London. In Table 8.2, we converted the data from Table 8.1 
to a numerical format, and each postcode is represented as two numbers, 
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GenderID Degree Latitude Longitude Age Annual Salary 





(in degrees) (in degrees) (in thousands) 
-1 2 51.5073 0.1290 36 89.563 
-1 3 51.5074 0.1275 47 123.543 
+1 1 51.5071 0.1278 26 23.989 
-1 1 51.5075 0.1281 68 138.769 
+1 2 51.5074 0.1278 33 113.888 


a latitude and longitude. Even numerical data that could potentially be 
directly read into a machine learning algorithm should be carefully con- 
sidered for units, scaling, and constraints. Without additional information, 
one should shift and scale all columns of the dataset such that they have 
an empirical mean of 0 and an empirical variance of 1. For the purposes 
of this book, we assume that a domain expert already converted data ap- 
propriately, i.e., each input x,, is a D-dimensional vector of real numbers, 
which are called features, attributes, or covariates. We consider a dataset to 
be of the form as illustrated by Table 8.2. Observe that we have dropped 
the Name column of Table 8.1 in the new numerical representation. There 
are two main reasons why this is desirable: (1) we do not expect the iden- 
tifier (the Name) to be informative for a machine learning task; and (2) 
we may wish to anonymize the data to help protect the privacy of the 
employees. 

In this part of the book, we will use N to denote the number of exam- 
ples in a dataset and index the examples with lowercase n = 1,...,N. 
We assume that we are given a set of numerical data, represented as an 
array of vectors (Table 8.2). Each row is a particular individual z,,, often 
referred to as an example or data point in machine learning. The subscript 
n refers to the fact that this is the nth example out of a total of N exam- 
ples in the dataset. Each column represents a particular feature of interest 
about the example, and we index the features as d = 1,..., D. Recall that 
data is represented as vectors, which means that each example (each data 
point) is a D-dimensional vector. The orientation of the table originates 
from the database community, but for some machine learning algorithms 
(e.g., in Chapter 10) it is more convenient to represent examples as col- 
umn vectors. 

Let us consider the problem of predicting annual salary from age, based 
on the data in Table 8.2. This is called a supervised learning problem 
where we have a label y,, (the salary) associated with each example z,, 
(the age). The label y,, has various other names, including target, re- 
sponse variable, and annotation. A dataset is written as a set of example- 
label pairs {(£1, y1), ---, (Ln, Yn), ---, (£N, Yyn)}. The table of examples 
{æ1,..., æn} is often concatenated, and written as X € R*”. Fig- 
ure 8.1 illustrates the dataset consisting of the two rightmost columns 
of Table 8.2, where x = age and y = salary. 

We use the concepts introduced in the first part of the book to formalize 
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Table 8.2 Example 
data from a 
fictitious human 
resource database 
(see Table 8.1), 
converted to a 
numerical format. 


feature 
attribute 
covariate 


example 


data point 


label 


Figure 8.1 Toy data 
for linear regression. 
Training data in 
(£n, Yn) pairs from 
the rightmost two 
columns of 

Table 8.2. We are 
interested in the 
salary of a person 
aged sixty (x = 60) 
illustrated as a 
vertical dashed red 
line, which is not 
part of the training 
data. 


feature map 


kernel 
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the machine learning problems such as that in the previous paragraph. 
Representing data as vectors x,, allows us to use concepts from linear al- 
gebra (introduced in Chapter 2). In many machine learning algorithms, 
we need to additionally be able to compare two vectors. As we will see in 
Chapters 9 and 12, computing the similarity or distance between two ex- 
amples allows us to formalize the intuition that examples with similar fea- 
tures should have similar labels. The comparison of two vectors requires 
that we construct a geometry (explained in Chapter 3) and allows us to 
optimize the resulting learning problem using techniques from Chapter 7. 

Since we have vector representations of data, we can manipulate data to 
find potentially better representations of it. We will discuss finding good 
representations in two ways: finding lower-dimensional approximations 
of the original feature vector, and using nonlinear higher-dimensional 
combinations of the original feature vector. In Chapter 10, we will see an 
example of finding a low-dimensional approximation of the original data 
space by finding the principal components. Finding principal components 
is closely related to concepts of eigenvalue and singular value decomposi- 
tion as introduced in Chapter 4. For the high-dimensional representation, 
we will see an explicit feature map ¢(-) that allows us to represent in- 
puts x,, using a higher-dimensional representation ¢(z,,). The main mo- 
tivation for higher-dimensional representations is that we can construct 
new features as non-linear combinations of the original features, which in 
turn may make the learning problem easier. We will discuss the feature 
map in Section 9.2 and show how this feature map leads to a kernel in 
Section 12.4. In recent years, deep learning methods (Goodfellow et al., 
2016) have shown promise in using the data itself to learn new good fea- 
tures and have been very successful in areas, such as computer vision, 
speech recognition, and natural language processing. We will not cover 
neural networks in this part of the book, but the reader is referred to 
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Section 5.6 for the mathematical description of backpropagation, a key 
concept for training neural networks. 


8.1.2 Models as Functions 


Once we have data in an appropriate vector representation, we can get to 
the business of constructing a predictive function (known as a predictor). 
In Chapter 1, we did not yet have the language to be precise about models. 
Using the concepts from the first part of the book, we can now introduce 
what “model” means. We present two major approaches in this book: a 
predictor as a function, and a predictor as a probabilistic model. We de- 
scribe the former here and the latter in the next subsection. 

A predictor is a function that, when given a particular input example 
(in our case, a vector of features), produces an output. For now, consider 
the output to be a single number, i.e., a real-valued scalar output. This can 
be written as 


f:R? SR, (8.1) 


where the input vector æ is D-dimensional (has D features), and the func- 
tion f then applied to it (written as f(a)) returns a real number. Fig- 
ure 8.2 illustrates a possible function that can be used to compute the 
value of the prediction for input values x. 

In this book, we do not consider the general case of all functions, which 
would involve the need for functional analysis. Instead, we consider the 
special case of linear functions 


f(a) =0'x4+% (8.2) 


for unknown @ and 6. This restriction means that the contents of Chap- 
ters 2 and 3 suffice for precisely stating the notion of a predictor for 
the non-probabilistic (in contrast to the probabilistic view described next) 
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Figure 8.2 Example 
function (black solid 
diagonal line) and 
its prediction at 

æ = 60, i.e., 

f(60) = 100. 
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Figure 8.3 Example 
function (black solid 
diagonal line) and 
its predictive 
uncertainty at 
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view of machine learning. Linear functions strike a good balance between 
the generality of the problems that can be solved and the amount of back- 
ground mathematics that is needed. 


8.1.3 Models as Probability Distributions 


We often consider data to be noisy observations of some true underlying 
effect, and hope that by applying machine learning we can identify the 
signal from the noise. This requires us to have a language for quantify- 
ing the effect of noise. We often would also like to have predictors that 
express some sort of uncertainty, e.g., to quantify the confidence we have 
about the value of the prediction for a particular test data point. As we 
have seen in Chapter 6, probability theory provides a language for quan- 
tifying uncertainty. Figure 8.3 illustrates the predictive uncertainty of the 
function as a Gaussian distribution. 

Instead of considering a predictor as a single function, we could con- 
sider predictors to be probabilistic models, i.e., models describing the dis- 
tribution of possible functions. We limit ourselves in this book to the spe- 
cial case of distributions with finite-dimensional parameters, which allows 
us to describe probabilistic models without needing stochastic processes 
and random measures. For this special case, we can think about prob- 
abilistic models as multivariate probability distributions, which already 
allow for a rich class of models. 

We will introduce how to use concepts from probability (Chapter 6) to 
define machine learning models in Section 8.4, and introduce a graphical 
language for describing probabilistic models in a compact way in Sec- 
tion 8.5. 
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8.1.4 Learning is Finding Parameters 


The goal of learning is to find a model and its corresponding parame- 
ters such that the resulting predictor will perform well on unseen data. 
There are conceptually three distinct algorithmic phases when discussing 
machine learning algorithms: 


1. Prediction or inference 
2. Training or parameter estimation 
3. Hyperparameter tuning or model selection 


The prediction phase is when we use a trained predictor on previously un- 
seen test data. In other words, the parameters and model choice is already 
fixed and the predictor is applied to new vectors representing new input 
data points. As outlined in Chapter 1 and the previous subsection, we will 
consider two schools of machine learning in this book, corresponding to 
whether the predictor is a function or a probabilistic model. When we 
have a probabilistic model (discussed further in Section 8.4) the predic- 
tion phase is called inference. 


Remark. Unfortunately, there is no agreed upon naming for the different 
algorithmic phases. The word “inference” is sometimes also used to mean 
parameter estimation of a probabilistic model, and less often may be also 
used to mean prediction for non-probabilistic models. ® 


The training or parameter estimation phase is when we adjust our pre- 
dictive model based on training data. We would like to find good predic- 
tors given training data, and there are two main strategies for doing so: 
finding the best predictor based on some measure of quality (sometimes 
called finding a point estimate), or using Bayesian inference. Finding a 
point estimate can be applied to both types of predictors, but Bayesian 
inference requires probabilistic models. 

For the non-probabilistic model, we follow the principle of empirical risk 
minimization, which we describe in Section 8.2. Empirical risk minimiza- 
tion directly provides an optimization problem for finding good parame- 
ters. With a statistical model, the principle of maximum likelihood is used 
to find a good set of parameters (Section 8.3). We can additionally model 
the uncertainty of parameters using a probabilistic model, which we will 
look at in more detail in Section 8.4. 

We use numerical methods to find good parameters that “fit” the data, 
and most training methods can be thought of as hill-climbing approaches 
to find the maximum of an objective, for example the maximum of a likeli- 
hood. To apply hill-climbing approaches we use the gradients described in 
Chapter 5 and implement numerical optimization approaches from Chap- 
ter 7. 

As mentioned in Chapter 1, we are interested in learning a model based 
on data such that it performs well on future data. It is not enough for 
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the model to only fit the training data well, the predictor needs to per- 
form well on unseen data. We simulate the behavior of our predictor on 
future unseen data using cross-validation (Section 8.2.4). As we will see 
in this chapter, to achieve the goal of performing well on unseen data, 
we will need to balance between fitting well on training data and finding 
“simple” explanations of the phenomenon. This trade-off is achieved us- 
ing regularization (Section 8.2.3) or by adding a prior (Section 8.3.2). In 
philosophy, this is considered to be neither induction nor deduction, but 
is called abduction. According to the Stanford Encyclopedia of Philosophy, 
abduction is the process of inference to the best explanation (Douven, 
2017). 

We often need to make high-level modeling decisions about the struc- 
ture of the predictor, such as the number of components to use or the 
class of probability distributions to consider. The choice of the number of 
components is an example of a hyperparameter, and this choice can af- 
fect the performance of the model significantly. The problem of choosing 
among different models is called model selection, which we describe in 
Section 8.6. For non-probabilistic models, model selection is often done 
using nested cross-validation, which is described in Section 8.6.1. We also 
use model selection to choose hyperparameters of our model. 


Remark. The distinction between parameters and hyperparameters is some- 
what arbitrary, and is mostly driven by the distinction between what can 
be numerically optimized versus what needs to use search techniques. 
Another way to consider the distinction is to consider parameters as the 
explicit parameters of a probabilistic model, and to consider hyperparam- 
eters (higher-level parameters) as parameters that control the distribution 
of these explicit parameters. > 


In the following sections, we will look at three flavors of machine learn- 
ing: empirical risk minimization (Section 8.2), the principle of maximum 
likelihood (Section 8.3), and probabilistic modeling (Section 8.4). 


8.2 Empirical Risk Minimization 


After having all the mathematics under our belt, we are now in a posi- 
tion to introduce what it means to learn. The “learning” part of machine 
learning boils down to estimating parameters based on training data. 

In this section, we consider the case of a predictor that is a function, 
and consider the case of probabilistic models in Section 8.3. We describe 
the idea of empirical risk minimization, which was originally popularized 
by the proposal of the support vector machine (described in Chapter 12). 
However, its general principles are widely applicable and allow us to ask 
the question of what is learning without explicitly constructing probabilis- 
tic models. There are four main design choices, which we will cover in 
detail in the following subsections: 
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Section 8.2.1 What is the set of functions we allow the predictor to take? 

Section 8.2.2 How do we measure how well the predictor performs on 
the training data? 

Section 8.2.3 How do we construct predictors from only training data 
that performs well on unseen test data? 

Section 8.2.4 What is the procedure for searching over the space of mod- 
els? 


8.2.1 Hypothesis Class of Functions 


Assume we are given NV examples x, € R? and corresponding scalar la- 
bels y,, € IR. We consider the supervised learning setting, where we obtain 
pairs (21, y1),.-.-,(a@n, yn). Given this data, we would like to estimate a 
predictor f(-,@) : R? — R, parametrized by @. We hope to be able to find 
a good parameter 6” such that we fit the data well, that is, 


{GeO ) Xo, foral n=1,..., N. (8.3) 


In this section, we use the notation ĝa = f (£n, 0*) to represent the output 
of the predictor. 


Remark. For ease of presentation, we will describe empirical risk mini- 
mization in terms of supervised learning (where we have labels). This 
simplifies the definition of the hypothesis class and the loss function. It 
is also common in machine learning to choose a parametrized class of 
functions, for example affine functions. © 


Example 8.1 

We introduce the problem of ordinary least-squares regression to illustrate 
empirical risk minimization. A more comprehensive account of regression 
is given in Chapter 9. When the label y,, is real-valued, a popular choice 
of function class for predictors is the set of affine functions. We choose a 


more compact notation for an affine function by concatenating an addi- 


tional unit feature 7°) = lio Œn, i.e., 2, = (1,2, 2©),...,2\)|". The 
parameter vector is correspondingly 0 = [,91, 02,...,9p]', allowing us 
to write the predictor as a linear function 
f(an,0) =0' a, . (8.4) 
This linear predictor is equivalent to the affine model 
D 
f(@n,0) =O +>) Oe. (8.5) 
d=1 


The predictor takes the vector of features representing a single example 
x, as input and produces a real-valued output, i.e., f : R?*! > R. The 
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previous figures in this chapter had a straight line as a predictor, which 
means that we have assumed an affine function. 

Instead of a linear function, we may wish to consider non-linear func- 
tions as predictors. Recent advances in neural networks allow for efficient 
computation of more complex non-linear function classes. 


Given the class of functions, we want to search for a good predictor. 
We now move on to the second ingredient of empirical risk minimization: 
how to measure how well the predictor fits the training data. 


8.2.2 Loss Function for Training 


Consider the label y,, for a particular example; and the corresponding pre- 
diction %,, that we make based on ~,,. To define what it means to fit the 
data well, we need to specify a loss function (yn, n) that takes the ground 
truth label and the prediction as input and produces a non-negative num- 
ber (referred to as the loss) representing how much error we have made 
on this particular prediction. Our goal for finding a good parameter vector 
6” is to minimize the average loss on the set of N training examples. 

One assumption that is commonly made in machine learning is that 
the set of examples (a), y1),...,(Xy, yn) is independent and identically 
distributed. The word independent (Section 6.4.5) means that two data 
points (a;, y;) and (a,;, y;) do not statistically depend on each other, mean- 
ing that the empirical mean is a good estimate of the population mean 
(Section 6.4.1). This implies that we can use the empirical mean of the 


loss on the training data. For a given training set {(%1,4),..-,(@Nn,Yyn)}, 
we introduce the notation of an example matrix X := [a,,...,xan]' € 
RN* and a label vector y := [yı,... yn]! € R". Using this matrix 


notation the average loss is given by 


N 


1 
Remp(f, X, y) = 47 2 !Wntin) » (8.6) 


n=1 


where ĝa = f(x, 0). Equation (8.6) is called the empirical risk and de- 
pends on three arguments, the predictor f and the data X, y. This general 
strategy for learning is called empirical risk minimization. 


Example 8.2 (Least-Squares Loss) 

Continuing the example of least-squares regression, we specify that we 
measure the cost of making an error during training using the squared 
loss (Yn; Ôn) = (Yn — Jn)”. We wish to minimize the empirical risk (8.6), 
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which is the average of the losses over the data 
1A 


aa N Fa = f(fn, 0)’, (8.7) 


n=1 
where we substituted the predictor 4, = f (£n, 0). By using our choice of 
a linear predictor f(x,,,0) = 0' «,,, we obtain the optimization problem 
ee 


R -aTa ð 
min nA ORTE (8.8) 


n=1 


This equation can be equivalently expressed in matrix form 


yl 2 
—|ly— Xa. ; 
min N lly | (8.9) 


OERP 
This is known as the least-squares problem. There exists a closed-form an- 


alytic solution for this by solving the normal equations, which we will 
discuss in Section 9.2. 


We are not interested in a predictor that only performs well on the 
training data. Instead, we seek a predictor that performs well (has low 
risk) on unseen test data. More formally, we are interested in finding a 
predictor f (with parameters fixed) that minimizes the expected risk 


Rirme(f) = Ex yle(y, f(@)) , (8.10) 


where y is the label and f(a) is the prediction based on the example a. 
The notation Ri:ue(f) indicates that this is the true risk if we had access to 
an infinite amount of data. The expectation is over the (infinite) set of all 
possible data and labels. There are two practical questions that arise from 
our desire to minimize expected risk, which we address in the following 
two subsections: 


= How should we change our training procedure to generalize well? 


= How do we estimate expected risk from (finite) data? 


Remark. Many machine learning tasks are specified with an associated 
performance measure, e.g., accuracy of prediction or root mean squared 
error. The performance measure could be more complex, be cost sensitive, 
and capture details about the particular application. In principle, the de- 
sign of the loss function for empirical risk minimization should correspond 
directly to the performance measure specified by the machine learning 
task. In practice, there is often a mismatch between the design of the loss 
function and the performance measure. This could be due to issues such 
as ease of implementation or efficiency of optimization. ro 
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8.2.3 Regularization to Reduce Overfitting 


This section describes an addition to empirical risk minimization that al- 
lows it to generalize well (approximately minimizing expected risk). Re- 
call that the aim of training a machine learning predictor is so that we can 
perform well on unseen data, i.e., the predictor generalizes well. We sim- 
ulate this unseen data by holding out a proportion of the whole dataset. 
This hold out set is referred to as the test set. Given a sufficiently rich class 
of functions for the predictor f, we can essentially memorize the training 
data to obtain zero empirical risk. While this is great to minimize the loss 
(and therefore the risk) on the training data, we would not expect the 
predictor to generalize well to unseen data. In practice, we have only a 
finite set of data, and hence we split our data into a training and a test 
set. The training set is used to fit the model, and the test set (not seen 
by the machine learning algorithm during training) is used to evaluate 
generalization performance. It is important for the user to not cycle back 
to a new round of training after having observed the test set. We use the 
subscripts train and test to denote the training and test sets, respectively. 
We will revisit this idea of using a finite dataset to evaluate expected risk 
in Section 8.2.4. 


It turns out that empirical risk minimization can lead to overfitting, i.e., 
the predictor fits too closely to the training data and does not general- 
ize well to new data (Mitchell, 1997). This general phenomenon of hav- 
ing very small average loss on the training set but large average loss on 
the test set tends to occur when we have little data and a complex hy- 
pothesis class. For a particular predictor f (with parameters fixed), the 
phenomenon of overfitting occurs when the risk estimate from the train- 
ing data Remp(f, X trains Yirain ) underestimates the expected risk Riruc(f). 
Since we estimate the expected risk Rirue(f) by using the empirical risk 
on the test set Remp(f, X test; Ytest) if the test risk is much larger than 
the training risk, this is an indication of overfitting. We revisit the idea of 
overfitting in Section 8.3.3. 


Therefore, we need to somehow bias the search for the minimizer of 
empirical risk by introducing a penalty term, which makes it harder for 
the optimizer to return an overly flexible predictor. In machine learning, 
the penalty term is referred to as regularization. Regularization is a way 
to compromise between accurate solution of empirical risk minimization 
and the size or complexity of the solution. 


Example 8.3 (Regularized Least Squares) 
Regularization is an approach that discourages complex or extreme solu- 
tions to an optimization problem. The simplest regularization strategy is 
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to replace the least-squares problem 
Heel 2 
U g ly- Xo. (8.11) 


in the previous example with the “regularized” problem by adding a 
penalty term involving only 8: 


: 1 2 2 
min + lly — X9| +All. (8.12) 


The additional term ||@||? is called the regularizer, and the parameter 
is the regularization parameter. The regularization parameter trades 
off minimizing the loss on the training set and the magnitude of the pa- 
rameters @. It often happens that the magnitude of the parameter values 
becomes relatively large if we run into overfitting (Bishop, 2006). 


The regularization term is sometimes called the penalty term, which bi- 
ases the vector @ to be closer to the origin. The idea of regularization also 
appears in probabilistic models as the prior probability of the parameters. 
Recall from Section 6.6 that for the posterior distribution to be of the same 
form as the prior distribution, the prior and the likelihood need to be con- 
jugate. We will revisit this idea in Section 8.3.2. We will see in Chapter 12 
that the idea of the regularizer is equivalent to the idea of a large margin. 


8.2.4 Cross-Validation to Assess the Generalization Performance 


We mentioned in the previous section that we measure the generalization 
error by estimating it by applying the predictor on test data. This data is 
also sometimes referred to as the validation set. The validation set is a sub- 
set of the available training data that we keep aside. A practical issue with 
this approach is that the amount of data is limited, and ideally we would 
use as much of the data available to train the model. This would require 
us to keep our validation set V small, which then would lead to a noisy 
estimate (with high variance) of the predictive performance. One solu- 
tion to these contradictory objectives (large training set, large validation 
set) is to use cross-validation. K-fold cross-validation effectively partitions 
the data into K chunks, kK — 1 of which form the training set R, and 
the last chunk serves as the validation set Y (similar to the idea outlined 
previously). Cross-validation iterates through (ideally) all combinations 
of assignments of chunks to R and V; see Figure 8.4. This procedure is 
repeated for all K choices for the validation set, and the performance of 
the model from the K runs is averaged. 

We partition our dataset into two sets D = RUY, such that they do not 
overlap (R N V = Ø), where Y is the validation set, and train our model 
on R. After training, we assess the performance of the predictor f on the 
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validation set V (e.g., by computing root mean square error (RMSE) of 
the trained model on the validation set). More precisely, for each partition 
k the training data R‘”) produces a predictor f“*), which is then applied 
to validation set V‘") to compute the empirical risk R(f, V)). We cycle 
through all possible partitionings of validation and training sets and com- 
pute the average generalization error of the predictor. Cross-validation 
approximates the expected generalization error 


1 


Ey[R(F,V)] ~ z J REO, V®), (8.13) 


K 
k=1 
where R(f®,V™)) is the risk (e.g., RMSE) on the validation set V® for 
predictor f‘*). The approximation has two sources: first, due to the finite 
training set, which results in not the best possible f"); and second, due to 
the finite validation set, which results in an inaccurate estimation of the 
risk R(f™, VY). A potential disadvantage of K-fold cross-validation is 
the computational cost of training the model K times, which can be bur- 
densome if the training cost is computationally expensive. In practice, it 
is often not sufficient to look at the direct parameters alone. For example, 
we need to explore multiple complexity parameters (e.g., multiple regu- 
larization parameters), which may not be direct parameters of the model. 
Evaluating the quality of the model, depending on these hyperparameters, 
may result in a number of training runs that is exponential in the number 
of model parameters. One can use nested cross-validation (Section 8.6.1) 
to search for good hyperparameters. 

However, cross-validation is an embarrassingly parallel problem, i.e., lit- 
tle effort is needed to separate the problem into a number of parallel 
tasks. Given sufficient computing resources (e.g., cloud computing, server 
farms), cross-validation does not require longer than a single performance 
assessment. 

In this section, we saw that empirical risk minimization is based on the 
following concepts: the hypothesis class of functions, the loss function and 
regularization. In Section 8.3, we will see the effect of using a probability 
distribution to replace the idea of loss functions and regularization. 
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8.2.5 Further Reading 


Due to the fact that the original development of empirical risk minimiza- 
tion (Vapnik, 1998) was couched in heavily theoretical language, many 
of the subsequent developments have been theoretical. The area of study 
is called statistical learning theory (Vapnik, 1999; Evgeniou et al., 2000; 
Hastie et al., 2001; von Luxburg and Scholkopf, 2011). A recent machine 
learning textbook that builds on the theoretical foundations and develops 
efficient learning algorithms is Shalev-Shwartz and Ben-David (2014). 

The concept of regularization has its roots in the solution of ill-posed in- 
verse problems (Neumaier, 1998). The approach presented here is called 
Tikhonov regularization, and there is a closely related constrained version 
called Ivanov regularization. Tikhonov regularization has deep relation- 
ships to the bias-variance trade-off and feature selection (Biuhlmann and 
Van De Geer, 2011). An alternative to cross-validation is bootstrap and 
jackknife (Efron and Tibshirani, 1993; Davidson and Hinkley, 1997; Hall, 
1992). 

Thinking about empirical risk minimization (Section 8.2) as “probabil- 
ity free” is incorrect. There is an underlying unknown probability distri- 
bution p(x, y) that governs the data generation. However, the approach 
of empirical risk minimization is agnostic to that choice of distribution. 
This is in contrast to standard statistical approaches that explicitly re- 
quire the knowledge of p(x, y). Furthermore, since the distribution is a 
joint distribution on both examples æ and labels y, the labels can be non- 
deterministic. In contrast to standard statistics we do not need to specify 
the noise distribution for the labels y. 


8.3 Parameter Estimation 


In Section 8.2, we did not explicitly model our problem using probability 
distributions. In this section, we will see how to use probability distribu- 
tions to model our uncertainty due to the observation process and our 
uncertainty in the parameters of our predictors. In Section 8.3.1, we in- 
troduce the likelihood, which is analogous to the concept of loss functions 
(Section 8.2.2) in empirical risk minimization. The concept of priors (Sec- 
tion 8.3.2) is analogous to the concept of regularization (Section 8.2.3). 


8.3.1 Maximum Likelihood Estimation 


The idea behind maximum likelihood estimation (MLE) is to define a func- 
tion of the parameters that enables us to find a model that fits the data 
well. The estimation problem is focused on the likelihood function, or 
more precisely its negative logarithm. For data represented by a random 
variable x and for a family of probability densities p(æ | 0) parametrized 
by 0, the negative log-likelihood is given by 
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L.(O) = — log p(a| 6). (8.14) 


The notation £,(@) emphasizes the fact that the parameter 6 is varying 
and the data = is fixed. We very often drop the reference to x when writing 
the negative log-likelihood, as it is really a function of 0, and write it as 
£(@) when the random variable representing the uncertainty in the data 
is clear from the context. 

Let us interpret what the probability density p(a | @) is modeling for a 
fixed value of @. It is a distribution that models the uncertainty of the data. 
In other words, once we have chosen the type of function we want as a 
predictor, the likelihood provides the probability of observing data x. 

In a complementary view, if we consider the data to be fixed (because 
it has been observed), and we vary the parameters 8, what does £(@) tell 
us? It tells us how likely a particular setting of @ is for the observations æ. 
Based on this second view, the maximum likelihood estimator gives us the 
most likely parameter @ for the set of data. 

We consider the supervised learning setting, where we obtain pairs 
(21,41),---,(&@n, yn) With x, € R? and labels y, € R. We are inter- 
ested in constructing a predictor that takes a feature vector x,, as input 
and produces a prediction y,, (or something close to it), i.e., given a vec- 
tor £„ we want the probability distribution of the label y,,. In other words, 
we specify the conditional probability distribution of the labels given the 
examples for the particular parameter setting 0. 


Example 8.4 

The first example that is often used is to specify that the conditional 
probability of the labels given the examples is a Gaussian distribution. In 
other words, we assume that we can explain our observation uncertainty 
by independent Gaussian noise (refer to Section 6.5) with zero mean, 
En ~ N (0, 0”). We further assume that the linear model æ} O is used for 
prediction. This means we specify a Gaussian likelihood for each example 
label pair (£n, Yn), 


Plyn | £n, 0) = N (yn | £0, 0°). (8.15) 


An illustration of a Gaussian likelihood for a given parameter 0 is shown 
in Figure 8.3. We will see in Section 9.2 how to explicitly expand the 
preceding expression out in terms of the Gaussian distribution. 


independent and We assume that the set of examples (7, y1),..., (Xn, yn) are independent 

identically and identically distributed (i.i.d.). The word “independent” (Section 6.4.5) 

distributed implies that the likelihood of the whole dataset (V = {y1,..., yn} and 
X = {x£1,..., £y} factorizes into a product of the likelihoods of each 
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individual example 


N 
P(Y|¥,0) = [] vn |an,9), (8.16) 


where p(y, | £n, 0) is a particular distribution (which was Gaussian in Ex- 
ample 8.4). The expression “identically distributed” means that each term 
in the product (8.16) is of the same distribution, and all of them share 
the same parameters. It is often easier from an optimization viewpoint to 
compute functions that can be decomposed into sums of simpler functions. 
Hence, in machine learning we often consider the negative log-likelihood 


L(0) = — log p(Y | ¥,@) S25 n Yn | £n, 0). (8.17) 


n=1 


While it is temping to interpret the fact that 0 is on the right of the condi- 
tioning in p(yn|£n, 0) (8.15), and hence should be interpreted as observed 
and fixed, this interpretation is incorrect. The negative log-likelihood £(0) 
is a function of 0. Therefore, to find a good parameter vector 0 that 
explains the data (x1, 41),...,(%N, yn) Well, minimize the negative log- 
likelihood £(@) with respect to 0. 

Remark. The negative sign in (8.17) is a historical artifact that is due 
to the convention that we want to maximize likelihood, but numerical 
optimization literature tends to study minimization of functions. ro 


Example 8.5 
Continuing on our example of Gaussian likelihoods (8.15), the negative 
log-likelihood can be rewritten as 





L(0 ZA OERO E -Y loen Yn| £10, 0°) (8.18a) 
n=1 
Den To 2 
= -De zo (Yn = ) ) (8.18b) 
o 


== Soe exp oa > log —— Br (8.18c) 
1 N 
=z z2 -Yoe es (8.18d) 


V2ro2 


As o is given, the second term in (8.18d) is constant, and minimizing £(6) 
corresponds to solving the least-squares problem (compare with (8.8)) 
expressed in the first term. 


It turns out that for Gaussian likelihoods the resulting optimization 
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diagonal line. The 
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prediction at 
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Figure 8.6 
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predictions with the 
maximum likelihood 
estimate and the 
MAP estimate at 
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biases the slope to 
be less steep and the 
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closer to zero. In 
this example, the 
bias that moves the 
intercept closer to 
zero actually 
increases the slope. 


268 When Models Meet Data 














- MLE Z 
— MAP “@ 











10 20 30 40 50 60 70 80 


problem corresponding to maximum likelihood estimation has a closed- 
form solution. We will see more details on this in Chapter 9. Figure 8.5 
shows a regression dataset and the function that is induced by the maxi- 
mum-likelihood parameters. Maximum likelihood estimation may suffer 
from overfitting (Section 8.3.3), analogous to unregularized empirical risk 
minimization (Section 8.2.3). For other likelihood functions, i.e., if we 
model our noise with non-Gaussian distributions, maximum likelihood es- 
timation may not have a closed-form analytic solution. In this case, we 
resort to numerical optimization methods discussed in Chapter 7. 


8.3.2 Maximum A Posteriori Estimation 


If we have prior knowledge about the distribution of the parameters 0, we 
can multiply an additional term to the likelihood. This additional term is 
a prior probability distribution on parameters p(@). For a given prior, after 
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observing some data a, how should we update the distribution of 0? In 
other words, how should we represent the fact that we have more specific 
knowledge of 0 after observing data x? Bayes theorem, as discussed in 
Section 6.3, gives us a principled tool to update our probability distribu- 
tions of random variables. It allows us to compute a posterior distribution 
p(@ | a) (the more specific knowledge) on the parameters 0 from general 
prior statements (prior distribution) p(@) and the function p(a | 0) that 
links the parameters @ and the observed data a (called the likelihood): 
pø jaj = BAe) (8.19) 
p(x) 
Recall that we are interested in finding the parameter @ that maximizes 
the posterior. Since the distribution p(a) does not depend on 0, we can 
ignore the value of the denominator for the optimization and obtain 


p(0 |x) x p(x | 0)p(0). (8.20) 


The preceding proportion relation hides the density of the data p(x), 
which may be difficult to estimate. Instead of estimating the minimum 
of the negative log-likelihood, we now estimate the minimum of the neg- 
ative log-posterior, which is referred to as maximum a posteriori estima- 
tion (MAP estimation). An illustration of the effect of adding a zero-mean 
Gaussian prior is shown in Figure 8.6. 


Example 8.6 

In addition to the assumption of Gaussian likelihood in the previous exam- 
ple, we assume that the parameter vector is distributed as a multivariate 
Gaussian with zero mean, i.e., p(0) = M (0, x), where » is the covari- 
ance matrix (Section 6.5). Note that the conjugate prior of a Gaussian 
is also a Gaussian (Section 6.6.1), and therefore we expect the posterior 
distribution to also be a Gaussian. We will see the details of maximum a 
posteriori estimation in Chapter 9. 


The idea of including prior knowledge about where good parameters 
lie is widespread in machine learning. An alternative view, which we saw 
in Section 8.2.3, is the idea of regularization, which introduces an addi- 
tional term that biases the resulting parameters to be close to the origin. 
Maximum a posteriori estimation can be considered to bridge the non- 
probabilistic and probabilistic worlds as it explicitly acknowledges the 
need for a prior distribution but it still only produces a point estimate 
of the parameters. 


Remark. The maximum likelihood estimate Om, possesses the following 
properties (Lehmann and Casella, 1998; Efron and Hastie, 2016): 


=a Asymptotic consistency: The MLE converges to the true value in the 
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Figure 8.7 Model 
fitting. In a 
parametrized class 
Mg of models, we 
optimize the model 
parameters 6 to 
minimize the 
distance to the true 
(unknown) model 
M*. 





limit of infinitely many observations, plus a random error that is ap- 
proximately normal. 

= The size of the samples necessary to achieve these properties can be 
quite large. 

= The error’s variance decays in 1/N, where N is the number of data 
points. 

= Especially, in the “small” data regime, maximum likelihood estimation 
can lead to overfitting. 


Q 


The principle of maximun likelihood estimation (and maximum a pos- 
teriori estimation) uses probabilistic modeling to reason about the uncer- 
tainty in the data and model parameters. However, we have not yet taken 
probabilistic modeling to its full extent. In this section, the resulting train- 
ing procedure still produces a point estimate of the predictor, i.e., training 
returns one single set of parameter values that represent the best predic- 
tor. In Section 8.4, we will take the view that the parameter values should 
also be treated as random variables, and instead of estimating “best” val- 
ues of that distribution, we will use the full parameter distribution when 
making predictions. 


8.3.3 Model Fitting 


Consider the setting where we are given a dataset, and we are interested 
in fitting a parametrized model to the data. When we talk about “fit- 
ting”, we typically mean optimizing/learning model parameters so that 
they minimize some loss function, e.g., the negative log-likelihood. With 
maximum likelihood (Section 8.3.1) and maximum a posteriori estima- 
tion (Section 8.3.2), we already discussed two commonly used algorithms 
for model fitting. 

The parametrization of the model defines a model class Mg with which 
we can operate. For example, in a linear regression setting, we may define 
the relationship between inputs x and (noise-free) observations y to be 
y = az + b, where @ := {a,b} are the model parameters. In this case, the 
model parameters @ describe the family of affine functions, i.e., straight 
lines with slope a, which are offset from 0 by b. Assume the data comes 
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from a model M*, which is unknown to us. For a given training dataset, 
we optimize 0 so that Mg is as close as possible to M/*, where the “close- 
ness” is defined by the objective function we optimize (e.g., squared loss 
on the training data). Figure 8.7 illustrates a setting where we have a small 
model class (indicated by the circle Mg), and the data generation model 
M* lies outside the set of considered models. We begin our parameter 
search at Mg,. After the optimization, i.e., when we obtain the best pos- 
sible parameters 6", we distinguish three different cases: (i) overfitting, 
Cii) underfitting, and (iii) fitting well. We will give a high-level intuition 
of what these three concepts mean. 

Roughly speaking, overfitting refers to the situation where the para- 
metrized model class is too rich to model the dataset generated by M*, 
i.e., Mo could model much more complicated datasets. For instance, if the 
dataset was generated by a linear function, and we define Mg to be the 
class of seventh-order polynomials, we could model not only linear func- 
tions, but also polynomials of degree two, three, etc. Models that over- 
fit typically have a large number of parameters. An observation we often 
make is that the overly flexible model class Mg uses all its modeling power 
to reduce the training error. If the training data is noisy, it will therefore 
find some useful signal in the noise itself. This will cause enormous prob- 
lems when we predict away from the training data. Figure 8.8(a) gives an 
example of overfitting in the context of regression where the model pa- 
rameters are learned by means of maximum likelihood (see Section 8.3.1). 
We will discuss overfitting in regression more in Section 9.2.2. 

When we run into underfitting, we encounter the opposite problem 
where the model class Mg is not rich enough. For example, if our dataset 
was generated by a sinusoidal function, but 8 only parametrizes straight 
lines, the best optimization procedure will not get us close to the true 
model. However, we still optimize the parameters and find the best straight 
line that models the dataset. Figure 8.8(b) shows an example of a model 
that underfits because it is insufficiently flexible. Models that underfit typ- 
ically have few parameters. 

The third case is when the parametrized model class is about right. 
Then, our model fits well, i.e., it neither overfits nor underfits. This means 
our model class is just rich enough to describe the dataset we are given. 
Figure 8.8(c) shows a model that fits the given dataset fairly well. Ideally, 
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this is the model class we would want to work with since it has good 
generalization properties. 

In practice, we often define very rich model classes Me with many pa- 
rameters, such as deep neural networks. To mitigate the problem of over- 
fitting, we can use regularization (Section 8.2.3) or priors (Section 8.3.2). 
We will discuss how to choose the model class in Section 8.6. 


8.3.4 Further Reading 


When considering probabilistic models, the principle of maximum likeli- 
hood estimation generalizes the idea of least-squares regression for linear 
models, which we will discuss in detail in Chapter 9. When restricting 
the predictor to have linear form with an additional nonlinear function y 
applied to the output, i.e., 


P(Yn|an,O) = (0 £n), (8.21) 


we can consider other models for other prediction tasks, such as binary 
classification or modeling count data (McCullagh and Nelder, 1989). An 
alternative view of this is to consider likelihoods that are from the ex- 
ponential family (Section 6.6). The class of models, which have linear 
dependence between parameters and data, and have potentially nonlin- 
ear transformation ọ (called a link function), is referred to as generalized 
linear models (Agresti, 2002, chapter 4). 

Maximum likelihood estimation has a rich history, and was originally 
proposed by Sir Ronald Fisher in the 1930s. We will expand upon the idea 
of a probabilistic model in Section 8.4. One debate among researchers 
who use probabilistic models, is the discussion between Bayesian and fre- 
quentist statistics. As mentioned in Section 6.1.1, it boils down to the 
definition of probability. Recall from Section 6.1 that one can consider 
probability to be a generalization (by allowing uncertainty) of logical rea- 
soning (Cheeseman, 1985; Jaynes, 2003). The method of maximum like- 
lihood estimation is frequentist in nature, and the interested reader is 
pointed to Efron and Hastie (2016) for a balanced view of both Bayesian 
and frequentist statistics. 

There are some probabilistic models where maximum likelihood esti- 
mation may not be possible. The reader is referred to more advanced sta- 
tistical textbooks, e.g., Casella and Berger (2002), for approaches, such as 
method of moments, M-estimation, and estimating equations. 


8.4 Probabilistic Modeling and Inference 


In machine learning, we are frequently concerned with the interpretation 
and analysis of data, e.g., for prediction of future events and decision 
making. To make this task more tractable, we often build models that 
describe the generative process that generates the observed data. 
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For example, we can describe the outcome of a coin-flip experiment 
(“heads” or “tails”) in two steps. First, we define a parameter u, which 
describes the probability of “heads” as the parameter of a Bernoulli distri- 
bution (Chapter 6); second, we can sample an outcome x € {head, tail} 
from the Bernoulli distribution p(x |) = Ber(w). The parameter u gives 
rise to a specific dataset X and depends on the coin used. Since p is un- 
known in advance and can never be observed directly, we need mecha- 
nisms to learn something about u given observed outcomes of coin-flip 
experiments. In the following, we will discuss how probabilistic modeling 
can be used for this purpose. 


8.4.1 Probabilistic Models 


Probabilistic models represent the uncertain aspects of an experiment as 
probability distributions. The benefit of using probabilistic models is that 
they offer a unified and consistent set of tools from probability theory 
(Chapter 6) for modeling, inference, prediction, and model selection. 

In probabilistic modeling, the joint distribution p(x, 0) of the observed 
variables x and the hidden parameters 0 is of central importance: It en- 
capsulates information from the following: 


= The prior and the likelihood (product rule, Section 6.3). 

= The marginal likelihood p(æ), which will play an important role in 
model selection (Section 8.6), can be computed by taking the joint dis- 
tribution and integrating out the parameters (sum rule, Section 6.3). 

= The posterior, which can be obtained by dividing the joint by the marginal 
likelihood. 


Only the joint distribution has this property. Therefore, a probabilistic 
model is specified by the joint distribution of all its random variables. 


8.4.2 Bayesian Inference 


A key task in machine learning is to take a model and the data to uncover 
the values of the model’s hidden variables 0 given the observed variables 
a. In Section 8.3.1, we already discussed two ways for estimating model 
parameters @ using maximum likelihood or maximum a posteriori esti- 
mation. In both cases, we obtain a single-best value for 8 so that the key 
algorithmic problem of parameter estimation is solving an optimization 
problem. Once these point estimates 0” are known, we use them to make 
predictions. More specifically, the predictive distribution will be p(a | 6°), 
where we use @* in the likelihood function. 

As discussed in Section 6.3, focusing solely on some statistic of the pos- 
terior distribution (such as the parameter 6° that maximizes the poste- 
rior) leads to loss of information, which can be critical in a system that 
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uses the prediction p(a| 0°) to make decisions. These decision-making 
systems typically have different objective functions than the likelihood, a 
squared-error loss or a mis-classification error. Therefore, having the full 
posterior distribution around can be extremely useful and leads to more 
robust decisions. Bayesian inference is about finding this posterior distri- 
bution (Gelman et al., 2004). For a dataset X, a parameter prior p(@), and 
a likelihood function, the posterior 


P(X | O)p(9) 


p(@| x) = PE 


p(X) = | (X10), (8.22) 
is obtained by applying Bayes’ theorem. The key idea is to exploit Bayes’ 
theorem to invert the relationship between the parameters 0 and the data 
X (given by the likelihood) to obtain the posterior distribution p(0 | ¥). 

The implication of having a posterior distribution on the parameters is 
that it can be used to propagate uncertainty from the parameters to the 
data. More specifically, with a distribution p(@) on the parameters our 
predictions will be 


p(w) = if p(a | 6)p(6)d0 = Eo[p(x|6)], (8.23) 


and they no longer depend on the model parameters 0, which have been 
marginalized/integrated out. Equation (8.23) reveals that the prediction 
is an average over all plausible parameter values 0, where the plausibility 
is encapsulated by the parameter distribution p(@). 

Having discussed parameter estimation in Section 8.3 and Bayesian in- 
ference here, let us compare these two approaches to learning. Parameter 
estimation via maximum likelihood or MAP estimation yields a consistent 
point estimate 0” of the parameters, and the key computational problem 
to be solved is optimization. In contrast, Bayesian inference yields a (pos- 
terior) distribution, and the key computational problem to be solved is 
integration. Predictions with point estimates are straightforward, whereas 
predictions in the Bayesian framework require solving another integration 
problem; see (8.23). However, Bayesian inference gives us a principled 
way to incorporate prior knowledge, account for side information, and 
incorporate structural knowledge, all of which is not easily done in the 
context of parameter estimation. Moreover, the propagation of parameter 
uncertainty to the prediction can be valuable in decision-making systems 
for risk assessment and exploration in the context of data-efficient learn- 
ing (Deisenroth et al., 2015; Kamthe and Deisenroth, 2018). 

While Bayesian inference is a mathematically principled framework for 
learning about parameters and making predictions, there are some prac- 
tical challenges that come with it because of the integration problems we 
need to solve; see (8.22) and (8.23). More specifically, if we do not choose 
a conjugate prior on the parameters (Section 6.6.1), the integrals in (8.22) 
and (8.23) are not analytically tractable, and we cannot compute the pos- 


Draft (2021-01-14) of “Mathematics for Machine Learning”. Feedback: https: //mml-book.com. 


8.4 Probabilistic Modeling and Inference 275 


terior, the predictions, or the marginal likelihood in closed form. In these 
cases, we need to resort to approximations. Here, we can use stochas- 
tic approximations, such as Markov chain Monte Carlo (MCMC) (Gilks 
et al., 1996), or deterministic approximations, such as the Laplace ap- 
proximation (Bishop, 2006; Barber, 2012; Murphy, 2012), variational in- 
ference (Jordan et al., 1999; Blei et al., 2017), or expectation propaga- 
tion (Minka, 2001a). 

Despite these challenges, Bayesian inference has been successfully ap- 
plied to a variety of problems, including large-scale topic modeling (Hoff- 
man et al., 2013), click-through-rate prediction (Graepel et al., 2010), 
data-efficient reinforcement learning in control systems (Deisenroth et al., 
2015), online ranking systems (Herbrich et al., 2007), and large-scale rec- 
ommender systems. There are generic tools, such as Bayesian optimiza- 
tion (Brochu et al., 2009; Snoek et al., 2012; Shahriari et al., 2016), that 
are very useful ingredients for an efficient search of meta parameters of 
models or algorithms. 


Remark. In the machine learning literature, there can be a somewhat ar- 
bitrary separation between (random) “variables” and “parameters”. While 
parameters are estimated (e.g., via maximum likelihood), variables are 
usually marginalized out. In this book, we are not so strict with this sep- 
aration because, in principle, we can place a prior on any parameter and 
integrate it out, which would then turn the parameter into a random vari- 
able according to the aforementioned separation. ro 


8.4.3 Latent-Variable Models 


In practice, it is sometimes useful to have additional latent variables z latent variable 
(besides the model parameters 0) as part of the model (Moustaki et al., 
2015). These latent variables are different from the model parameters 
0 as they do not parametrize the model explicitly. Latent variables may 
describe the data-generating process, thereby contributing to the inter- 
pretability of the model. They also often simplify the structure of the 
model and allow us to define simpler and richer model structures. Sim- 
plification of the model structure often goes hand in hand with a smaller 
number of model parameters (Paquet, 2008; Murphy, 2012). Learning in 
latent-variable models (at least via maximum likelihood) can be done in a 
principled way using the expectation maximization (EM) algorithm (Demp- 
ster et al., 1977; Bishop, 2006). Examples, where such latent variables 
are helpful, are principal component analysis for dimensionality reduc- 
tion (Chapter 10), Gaussian mixture models for density estimation (Chap- 
ter 11), hidden Markov models (Maybeck, 1979) or dynamical systems 
(Ghahramani and Roweis, 1999; Ljung, 1999) for time-series modeling, 
and meta learning and task generalization (Hausman et al., 2018; Se- 
mundsson et al., 2018). Although the introduction of these latent variables 
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may make the model structure and the generative process easier, learning 
in latent-variable models is generally hard, as we will see in Chapter 11. 

Since latent-variable models also allow us to define the process that 
generates data from parameters, let us have a look at this generative pro- 
cess. Denoting data by x, the model parameters by @ and the latent vari- 
ables by z, we obtain the conditional distribution 


p(x |z,@) (8.24) 


that allows us to generate data for any model parameters and latent vari- 
ables. Given that z are latent variables, we place a prior p(z) on them. 

As the models we discussed previously, models with latent variables 
can be used for parameter learning and inference within the frameworks 
we discussed in Sections 8.3 and 8.4.2. To facilitate learning (e.g., by 
means of maximum likelihood estimation or Bayesian inference), we fol- 
low a two-step procedure. First, we compute the likelihood p(a | @) of the 
model, which does not depend on the latent variables. Second, we use this 
likelihood for parameter estimation or Bayesian inference, where we use 
exactly the same expressions as in Sections 8.3 and 8.4.2, respectively. 

Since the likelihood function p(a | 0) is the predictive distribution of the 
data given the model parameters, we need to marginalize out the latent 
variables so that 


os I pa leapelds (8.25) 


where p(æ|z,0) is given in (8.24) and p(z) is the prior on the latent 
variables. Note that the likelihood must not depend on the latent variables 
z, but it is only a function of the data x and the model parameters 0. 

The likelihood in (8.25) directly allows for parameter estimation via 
maximum likelihood. MAP estimation is also straightforward with an ad- 
ditional prior on the model parameters @ as discussed in Section 8.3.2. 
Moreover, with the likelihood (8.25) Bayesian inference (Section 8.4.2) 
in a latent-variable model works in the usual way: We place a prior p(0) 
on the model parameters and use Bayes’ theorem to obtain a posterior 
distribution 


p(X | 0)p(0) 


pOL) = PE 


(8.26) 
over the model parameters given a dataset V. The posterior in (8.26) can 
be used for predictions within a Bayesian inference framework; see (8.23). 

One challenge we have in this latent-variable model is that the like- 
lihood p(4 |@) requires the marginalization of the latent variables ac- 
cording to (8.25). Except when we choose a conjugate prior p(z) for 
p(x |z,0), the marginalization in (8.25) is not analytically tractable, and 
we need to resort to approximations (Bishop, 2006; Paquet, 2008; Mur- 
phy, 2012; Moustaki et al., 2015). 
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Similar to the parameter posterior (8.26) we can compute a posterior 
on the latent variables according to 


p(& | z)p(z) 
pæ) 


where p(z) is the prior on the latent variables and p(¥ | z) requires us to 
integrate out the model parameters 0. 

Given the difficulty of solving integrals analytically, it is clear that mar- 
ginalizing out both the latent variables and the model parameters at the 
same time is not possible in general (Bishop, 2006; Murphy, 2012). A 
quantity that is easier to compute is the posterior distribution on the latent 
variables, but conditioned on the model parameters, i.e., 


p(X | z, 0)p(z) 
P(X |O)  ¢ 


plz |#) = P |z) = | r(X|2,8)p(0)d8, 6.27) 


plz|,0) = (8.28) 
where p(z) is the prior on the latent variables and p(¥ | z,0) is given 
in (8.24). 

In Chapters 10 and 11, we derive the likelihood functions for PCA and 
Gaussian mixture models, respectively. Moreover, we compute the poste- 
rior distributions (8.28) on the latent variables for both PCA and Gaussian 
mixture models. 


Remark. In the following chapters, we may not be drawing such a clear 
distinction between latent variables z and uncertain model parameters 0 
and call the model parameters “latent” or “hidden” as well because they 
are unobserved. In Chapters 10 and 11, where we use the latent variables 
z, we will pay attention to the difference as we will have two different 
types of hidden variables: model parameters 0 and latent variables z. © 


We can exploit the fact that all the elements of a probabilistic model are 
random variables to define a unified language for representing them. In 
Section 8.5, we will see a concise graphical language for representing the 
structure of probabilistic models. We will use this graphical language to 
describe the probabilistic models in the subsequent chapters. 


8.4.4 Further Reading 


Probabilistic models in machine learning (Bishop, 2006; Barber, 2012; 
Murphy, 2012) provide a way for users to capture uncertainty about data 
and predictive models in a principled fashion. Ghahramani (2015) presents 
a short review of probabilistic models in machine learning. Given a proba- 
bilistic model, we may be lucky enough to be able to compute parameters 
of interest analytically. However, in general, analytic solutions are rare, 
and computational methods such as sampling (Gilks et al., 1996; Brooks 
et al., 2011) and variational inference (Jordan et al., 1999; Blei et al., 
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2017) are used. Moustaki et al. (2015) and Paquet (2008) provide a good 
overview of Bayesian inference in latent-variable models. 

In recent years, several programming languages have been proposed 
that aim to treat the variables defined in software as random variables 
corresponding to probability distributions. The objective is to be able to 
write complex functions of probability distributions, while under the hood 
the compiler automatically takes care of the rules of Bayesian inference. 
This rapidly changing field is called probabilistic programming. 


8.5 Directed Graphical Models 


In this section, we introduce a graphical language for specifying a prob- 
abilistic model, called the directed graphical model. It provides a compact 
and succinct way to specify probabilistic models, and allows the reader to 
visually parse dependencies between random variables. A graphical model 
visually captures the way in which the joint distribution over all random 
variables can be decomposed into a product of factors depending only on 
a subset of these variables. In Section 8.4, we identified the joint distri- 
bution of a probabilistic model as the key quantity of interest because it 
comprises information about the prior, the likelihood, and the posterior. 
However, the joint distribution by itself can be quite complicated, and 
it does not tell us anything about structural properties of the probabilis- 
tic model. For example, the joint distribution p(a, b,c) does not tell us 
anything about independence relations. This is the point where graphical 
models come into play. This section relies on the concepts of independence 
and conditional independence, as described in Section 6.4.5. 

In a graphical model, nodes are random variables. In Figure 8.9(a), the 
nodes represent the random variables a, b, c. Edges represent probabilistic 
relations between variables, e.g., conditional probabilities. 


Remark. Not every distribution can be represented in a particular choice of 
graphical model. A discussion of this can be found in Bishop (2006). © 


Probabilistic graphical models have some convenient properties: 


= They are a simple way to visualize the structure of a probabilistic model. 

= They can be used to design or motivate new kinds of statistical models. 

= Inspection of the graph alone gives us insight into properties, e.g., con- 
ditional independence. 

= Complex computations for inference and learning in statistical models 
can be expressed in terms of graphical manipulations. 


8.5.1 Graph Semantics 


Directed graphical models/Bayesian networks are a method for representing 
conditional dependencies in a probabilistic model. They provide a visual 
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description of the conditional probabilities, hence, providing a simple lan- 
guage for describing complex interdependence. The modular description 
also entails computational simplification. Directed links (arrows) between 
two nodes (random variables) indicate conditional probabilities. For ex- 
ample, the arrow between a and 0 in Figure 8.9(a) gives the conditional 
probability p(b | a) of b given a. 





(a) Fully connected. 


(b) Not fully connected. 


Directed graphical models can be derived from joint distributions if we 
know something about their factorization. 


Example 8.7 
Consider the joint distribution 


p(a, b,c) = p(c| a, b)p(6| a)p(a) (8.29) 


of three random variables a, b, c. The factorization of the joint distribution 
in (8.29) tells us something about the relationship between the random 
variables: 


= c depends directly on a and b. 
= b depends directly on a. 
= a depends neither on b nor on c. 


For the factorization in (8.29), we obtain the directed graphical model in 
Figure 8.9(a). 


In general, we can construct the corresponding directed graphical model 
from a factorized joint distribution as follows: 


1. Create a node for all random variables. 

2. For each conditional distribution, we add a directed link (arrow) to 
the graph from the nodes corresponding to the variables on which the 
distribution is conditioned. 


The graph layout depends on the choice of factorization of the joint dis- 
tribution. 

We discussed how to get from a known factorization of the joint dis- 
tribution to the corresponding directed graphical model. Now, we will do 
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exactly the opposite and describe how to extract the joint distribution of 
a set of random variables from a given graphical model. 


Example 8.8 
Looking at the graphical model in Figure 8.9(b), we exploit two proper- 
ties: 


= The joint distribution p(x, ..., £5) we seek is the product of a set of 
conditionals, one for each node in the graph. In this particular example, 
we will need five conditionals. 

= Each conditional depends only on the parents of the corresponding 
node in the graph. For example, x, will be conditioned on x2. 


These two properties yield the desired factorization of the joint distribu- 
tion 


P(@1, £2, T3, £4, 15) = p(£1)p(x5)p(x2 | £5)p(£3 | £1, £2)p(x4 | x2). (8.30) 


In general, the joint distribution p(æ) = p(z1,..., £g) is given as 
K 
p(x) = |] | p(s | Pax), (8.31) 
k=1 


where Pa; means “the parent nodes of x;,”. Parent nodes of x, are nodes 
that have arrows pointing to £p. 

We conclude this subsection with a concrete example of the coin-flip 
experiment. Consider a Bernoulli experiment (Example 6.8) where the 
probability that the outcome z< of this experiment is “heads” is 


plz | u) = Ber( u). (8.32) 


We now repeat this experiment N times and observe outcomes £1, ..., £N 
so that we obtain the joint distribution 


N 
n=1 
The expression on the right-hand side is a product of Bernoulli distribu- 
tions on each individual outcome because the experiments are indepen- 
dent. Recall from Section 6.4.5 that statistical independence means that 
the distribution factorizes. To write the graphical model down for this set- 
ting, we make the distinction between unobserved/latent variables and 
observed variables. Graphically, observed variables are denoted by shaded 
nodes so that we obtain the graphical model in Figure 8.10(a). We see 
that the single parameter jy is the same for all x,, n = 1,...,N as the 
outcomes zn are identically distributed. A more compact, but equivalent, 
graphical model for this setting is given in Figure 8.10(b), where we use 
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(a) Version with zn explicit. (b) Version with (c) Hyperparameters a 
plate notation. and ( on the latent p. 


the plate notation. The plate (box) repeats everything inside (in this case, 
the observations x,,) N times. Therefore, both graphical models are equiv- 
alent, but the plate notation is more compact. Graphical models immedi- 
ately allow us to place a hyperprior on u. A hyperprior is a second layer 
of prior distributions on the parameters of the first layer of priors. Fig- 
ure 8.10(c) places a Beta(a, 3) prior on the latent variable u. If we treat 
qa and ( as deterministic parameters, i.e., not random variables, we omit 
the circle around it. 


8.5.2 Conditional Independence and d-Separation 


Directed graphical models allow us to find conditional independence (Sec- 
tion 6.4.5) relationship properties of the joint distribution only by looking 
at the graph. A concept called d-separation (Pearl, 1988) is key to this. 

Consider a general directed graph in which A, B,C are arbitrary nonin- 
tersecting sets of nodes (whose union may be smaller than the complete 
set of nodes in the graph). We wish to ascertain whether a particular con- 
ditional independence statement, “A is conditionally independent of B 
given C”, denoted by 


AIL BIC, (8.34) 


is implied by a given directed acyclic graph. To do so, we consider all 
possible trails (paths that ignore the direction of the arrows) from any 
node in A to any nodes in B. Any such path is said to be blocked if it 
includes any node such that either of the following are true: 


= The arrows on the path meet either head to tail or tail to tail at the 
node, and the node is in the set C. 


= The arrows meet head to head at the node, and neither the node nor 
any of its descendants is in the set C. 


If all paths are blocked, then A is said to be d-separated from B by C, 
and the joint distribution over all of the variables in the graph will satisfy 
A dB. 
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Example 8.9 (Conditional Independence) 


Consider the graphical model in Figure 8.11. Visual inspection gives us 


b IL d|a,c (8.35) 
a IL c|b (8.36) 
bi dlc (8.37) 
a  c|b,e (8.38) 


Directed graphical models allow a compact representation of proba- 
bilistic models, and we will see examples of directed graphical models in 
Chapters 9, 10, and 11. The representation, along with the concept of con- 
ditional independence, allows us to factorize the respective probabilistic 
models into expressions that are easier to optimize. 


The graphical representation of the probabilistic model allows us to 
visually see the impact of design choices we have made on the structure 
of the model. We often need to make high-level assumptions about the 
structure of the model. These modeling assumptions (hyperparameters) 
affect the prediction performance, but cannot be selected directly using 
the approaches we have seen so far. We will discuss different ways to 
choose the structure in Section 8.6. 
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8.5.3 Further Reading 


An introduction to probabilistic graphical models can be found in Bishop 
(2006, chapter 8), and an extensive description of the different applica- 
tions and corresponding algorithmic implications can be found in the book 
by Koller and Friedman (2009). There are three main types of probabilistic 
graphical models: 


= Directed graphical models (Bayesian networks); see Figure 8.12(a) 
« Undirected graphical models (Markov random fields); see Figure 8.12(b) 
= Factor graphs; see Figure 8.12(c) 


Graphical models allow for graph-based algorithms for inference and 
learning, e.g., via local message passing. Applications range from rank- 
ing in online games (Herbrich et al., 2007) and computer vision (e.g., 
image segmentation, semantic labeling, image denoising, image restora- 
tion (Kittler and Föglein, 1984; Sucar and Gillies, 1994; Shotton et al., 
2006; Szeliski et al., 2008)) to coding theory (McEliece et al., 1998), solv- 
ing linear equation systems (Shental et al., 2008), and iterative Bayesian 
state estimation in signal processing (Bickson et al., 2007; Deisenroth and 
Mohamed, 2012). 

One topic that is particularly important in real applications that we do 
not discuss in this book is the idea of structured prediction (Bakir et al., 
2007; Nowozin et al., 2014), which allows machine learning models to 
tackle predictions that are structured, for example sequences, trees, and 
graphs. The popularity of neural network models has allowed more flex- 
ible probabilistic models to be used, resulting in many useful applica- 
tions of structured models (Goodfellow et al., 2016, chapter 16). In recent 
years, there has been a renewed interest in graphical models due to their 
applications to causal inference (Pearl, 2009; Imbens and Rubin, 2015; 
Peters et al., 2017; Rosenbaum, 2017). 


8.6 Model Selection 


In machine learning, we often need to make high-level modeling decisions 
that critically influence the performance of the model. The choices we 
make (e.g., the functional form of the likelihood) influence the number 
and type of free parameters in the model and thereby also the flexibility 
and expressivity of the model. More complex models are more flexible in 
the sense that they can be used to describe more datasets. For instance, a 
polynomial of degree 1 (a line y = ap + a,x) can only be used to describe 
linear relations between inputs z and observations y. A polynomial of 
degree 2 can additionally describe quadratic relationships between inputs 
and observations. 

One would now think that very flexible models are generally preferable 
to simple models because they are more expressive. A general problem 
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is that at training time we can only use the training set to evaluate the 
performance of the model and learn its parameters. However, the per- 
formance on the training set is not really what we are interested in. In 
Section 8.3, we have seen that maximum likelihood estimation can lead 
to overfitting, especially when the training dataset is small. Ideally, our 
model (also) works well on the test set (which is not available at training 
time). Therefore, we need some mechanisms for assessing how a model 
generalizes to unseen test data. Model selection is concerned with exactly 
this problem. 


8.6.1 Nested Cross-Validation 


We have already seen an approach (cross-validation in Section 8.2.4) that 
can be used for model selection. Recall that cross-validation provides an 
estimate of the generalization error by repeatedly splitting the dataset into 
training and validation sets. We can apply this idea one more time, i.e., 
for each split, we can perform another round of cross-validation. This is 
sometimes referred to as nested cross-validation; see Figure 8.13. The inner 
level is used to estimate the performance of a particular choice of model 
or hyperparameter on a internal validation set. The outer level is used to 
estimate generalization performance for the best choice of model chosen 
by the inner loop. We can test different model and hyperparameter choices 
in the inner loop. To distinguish the two levels, the set used to estimate 
the generalization performance is often called the test set and the set used 
for choosing the best model is called the validation set. The inner loop 
estimates the expected value of the generalization error for a given model 
(8.39), by approximating it using the empirical error on the validation set, 
i.e., 


1 


Ey[R(V|M)] = 5 (8.39) 


K 
SRV | M), 
k=1 

where R(V | M) is the empirical risk (e.g., root mean square error) on the 
validation set V for model M. We repeat this procedure for all models and 
choose the model that performs best. Note that cross-validation not only 
gives us the expected generalization error, but we can also obtain high- 


order statistics, e.g., the standard error, an estimate of how uncertain the 
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Evidence 


p(P | Mi) 


p(D | M2) 





mean estimate is. Once the model is chosen, we can evaluate the final 
performance on the test set. 


8.6.2 Bayesian Model Selection 


There are many approaches to model selection, some of which are covered 
in this section. Generally, they all attempt to trade off model complexity 
and data fit. We assume that simpler models are less prone to overfitting 
than complex models, and hence the objective of model selection is to find 
the simplest model that explains the data reasonably well. This concept is 
also known as Occam’s razor. 


Remark. If we treat model selection as a hypothesis testing problem, we 


are looking for the simplest hypothesis that is consistent with the data (Mur- 


phy, 2012). & 

One may consider placing a prior on models that favors simpler models. 
However, it is not necessary to do this: An “automatic Occam’s Razor” is 
quantitatively embodied in the application of Bayesian probability (Smith 
and Spiegelhalter, 1980; Jefferys and Berger, 1992; MacKay, 1992). Fig- 
ure 8.14, adapted from MacKay (2003), gives us the basic intuition why 
complex and very expressive models may turn out to be a less probable 
choice for modeling a given dataset D. Let us think of the horizontal axis 
representing the space of all possible datasets D. If we are interested in 
the posterior probability p( M; |D) of model M; given the data D, we can 
employ Bayes’ theorem. Assuming a uniform prior p(M) over all mod- 
els, Bayes’ theorem rewards models in proportion to how much they pre- 
dicted the data that occurred. This prediction of the data given model 
Mi, p(D | M;), is called the evidence for M;. A simple model M, can only 
predict a small number of datasets, which is shown by p(D | M1); a more 
powerful model M, that has, e.g., more free parameters than Mi, is able 
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to predict a greater variety of datasets. This means, however, that M3 
does not predict the datasets in region C as well as M,. Suppose that 
equal prior probabilities have been assigned to the two models. Then, if 
the dataset falls into region C’, the less powerful model M, is the more 
probable model. 

Earlier in this chapter, we argued that models need to be able to explain 
the data, i.e., there should be a way to generate data from a given model. 
Furthermore, if the model has been appropriately learned from the data, 
then we expect that the generated data should be similar to the empirical 
data. For this, it is helpful to phrase model selection as a hierarchical 
inference problem, which allows us to compute the posterior distribution 
over models. 

Let us consider a finite number of models M = {M,,..., Mx}, where 
each model M;, possesses parameters 0;,. In Bayesian model selection, we 
place a prior p(/) on the set of models. The corresponding generative 
process that allows us to generate data from this model is 


My ~ p(M) (8.40) 
D ~ p(D|0,) (8.42) 


and illustrated in Figure 8.15. Given a training set D, we apply Bayes’ 
theorem and compute the posterior distribution over models as 


P(M;,|D) « p(Mx)p(P | Mz). (8.43) 


Note that this posterior no longer depends on the model parameters @;, 
because they have been integrated out in the Bayesian setting since 


p(D| My) = | p(D|8)p(8x | Me)dBx. (8.44) 
where (0; | M;,) is the prior distribution of the model parameters 0; of 
model M,. The term (8.44) is referred to as the model evidence or marginal 
likelihood. From the posterior in (8.43), we determine the MAP estimate 


M* = arg max p( My ID). (8.45) 


With a uniform prior p(M;,) = +, which gives every model equal (prior) 
probability, determining the MAP estimate over models amounts to pick- 
ing the model that maximizes the model evidence (8.44). 


Remark (Likelihood and Marginal Likelihood). There are some important 
differences between a likelihood and a marginal likelihood (evidence): 
While the likelihood is prone to overfitting, the marginal likelihood is typ- 
ically not as the model parameters have been marginalized out (i.e., we 
no longer have to fit the parameters). Furthermore, the marginal likeli- 
hood automatically embodies a trade-off between model complexity and 
data fit (Occam’s razor). 


Draft (2021-01-14) of “Mathematics for Machine Learning”. Feedback: https: //mml-book.com. 


8.6 Model Selection 287 


8.6.3 Bayes Factors for Model Comparison 


Consider the problem of comparing two probabilistic models M,, Mo, 
given a dataset D. If we compute the posteriors p(M,|D) and p(M,|D), 
we can compute the ratio of the posteriors 


(D | M1)p(M1) 
p(Mı | D) = ID p(Mı) p(D | Mı) 





= = : 8.46 
p(My|D) ~ LDL OEY = 5M) p(D| Mn) nn 
a p(D) SHAN 
posterior odds prior odds Bayes factor 


The ratio of the posteriors is also called the posterior odds. The first frac- 
tion on the right-hand side of (8.46), the prior odds, measures how much 
our prior (initial) beliefs favor M, over Mə. The ratio of the marginal like- 
lihoods (second fraction on the right-hand-side) is called the Bayes factor 
and measures how well the data D is predicted by M, compared to M3. 


Remark. The Jeffreys-Lindley paradox states that the “Bayes factor always 
favors the simpler model since the probability of the data under a complex 
model with a diffuse prior will be very small” (Murphy, 2012). Here, a 
diffuse prior refers to a prior that does not favor specific models, i.e., 
many models are a priori plausible under this prior. ro) 


If we choose a uniform prior over models, the prior odds term in (8.46) 
is 1, i-e., the posterior odds is the ratio of the marginal likelihoods (Bayes 
factor) 

p(D | Mi) 

— (8.47) 

p(D | Mz) 
If the Bayes factor is greater than 1, we choose model M,, otherwise 
model M,. In a similar way to frequentist statistics, there are guidelines 
on the size of the ratio that one should consider before ”significance” of 
the result (Jeffreys, 1961). 


Remark (Computing the Marginal Likelihood). The marginal likelihood 
plays an important role in model selection: We need to compute Bayes 
factors (8.46) and posterior distributions over models (8.43). 
Unfortunately, computing the marginal likelihood requires us to solve 
an integral (8.44). This integration is generally analytically intractable, 
and we will have to resort to approximation techniques, e.g., numerical 
integration (Stoer and Burlirsch, 2002), stochastic approximations using 


Monte Carlo (Murphy, 2012), or Bayesian Monte Carlo techniques (O’Hagan, 


1991; Rasmussen and Ghahramani, 2003). 

However, there are special cases in which we can solve it. In Section 6.6.1, 
we discussed conjugate models. If we choose a conjugate parameter prior 
p(@), we can compute the marginal likelihood in closed form. In Chap- 
ter 9, we will do exactly this in the context of linear regression. > 


We have seen a brief introduction to the basic concepts of machine 
learning in this chapter. For the rest of this part of the book we will see 
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how the three different flavors of learning in Sections 8.2, 8.3, and 8.4 are 
applied to the four pillars of machine learning (regression, dimensionality 
reduction, density estimation, and classification). 


8.6.4 Further Reading 


We mentioned at the start of the section that there are high-level modeling 
choices that influence the performance of the model. Examples include the 
following: 


= The degree of a polynomial in a regression setting 

= The number of components in a mixture model 

= The network architecture of a (deep) neural network 

= The type of kernel in a support vector machine 

= The dimensionality of the latent space in PCA 

= The learning rate (schedule) in an optimization algorithm 


Rasmussen and Ghahramani (2001) showed that the automatic Occam’s 
razor does not necessarily penalize the number of parameters in a model, 
but it is active in terms of the complexity of functions. They also showed 
that the automatic Occam’s razor also holds for Bayesian nonparametric 
models with many parameters, e.g., Gaussian processes. 

If we focus on the maximum likelihood estimate, there exist a number of 
heuristics for model selection that discourage overfitting. They are called 
information criteria, and we choose the model with the largest value. The 
Akaike information criterion (AIC) (Akaike, 1974) 


log p(a| 0) — M (8.48) 


corrects for the bias of the maximum likelihood estimator by addition of 
a penalty term to compensate for the overfitting of more complex models 
with lots of parameters. Here, M is the number of model parameters. The 
AIC estimates the relative information lost by a given model. 

The Bayesian information criterion (BIC) (Schwarz, 1978) 


1 
Tepes = ts i, ple |0)p(0)d0 ~ log p(w|@)—;MlogN (8.49) 


can be used for exponential family distributions. Here, N is the number 
of data points and M is the number of parameters. BIC penalizes model 
complexity more heavily than AIC. 
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Linear Regression 


In the following, we will apply the mathematical concepts from Chap- 
ters 2, 5, 6, and 7 to solve linear regression (curve fitting) problems. In 
regression, we aim to find a function f that maps inputs æ € R? to corre- 
sponding function values f(a) € R. We assume we are given a set of train- 
ing inputs x,, and corresponding noisy observations y,, = f (x,,)+¢€, where 
e is an iid. random variable that describes measurement/observation 
noise and potentially unmodeled processes (which we will not consider 
further in this chapter). Throughout this chapter, we assume zero-mean 
Gaussian noise. Our task is to find a function that not only models the 
training data, but generalizes well to predicting function values at input 
locations that are not part of the training data (see Chapter 8). An il- 
lustration of such a regression problem is given in Figure 9.1. A typical 
regression setting is given in Figure 9.1(a): For some input values x,,, we 
observe (noisy) function values y,, = f(x,) + €. The task is to infer the 
function f that generated the data and generalizes well to function values 
at new input locations. A possible solution is given in Figure 9.1(b), where 
we also show three distributions centered at the function values f(x) that 
represent the noise in the data. 

Regression is a fundamental problem in machine learning, and regres- 
sion problems appear in a diverse range of research areas and applica- 
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tions, including time-series analysis (e.g., system identification), control 
and robotics (e.g., reinforcement learning, forward/inverse model learn- 
ing), optimization (e.g., line searches, global optimization), and deep- 
learning applications (e.g., computer games, speech-to-text translation, 
image recognition, automatic video annotation). Regression is also a key 
ingredient of classification algorithms. Finding a regression function re- 
quires solving a variety of problems, including the following: 


= Choice of the model (type) and the parametrization of the regres- 
sion function. Given a dataset, what function classes (e.g., polynomi- 
als) are good candidates for modeling the data, and what particular 
parametrization (e.g., degree of the polynomial) should we choose? 
Model selection, as discussed in Section 8.6, allows us to compare var- 
ious models to find the simplest model that explains the training data 
reasonably well. 

= Finding good parameters. Having chosen a model of the regression 
function, how do we find good model parameters? Here, we will need to 
look at different loss/objective functions (they determine what a “good” 
fit is) and optimization algorithms that allow us to minimize this loss. 

= Overfitting and model selection. Overfitting is a problem when the 
regression function fits the training data “too well” but does not gen- 
eralize to unseen test data. Overfitting typically occurs if the underly- 
ing model (or its parametrization) is overly flexible and expressive; see 
Section 8.6. We will look at the underlying reasons and discuss ways to 
mitigate the effect of overfitting in the context of linear regression. 

= Relationship between loss functions and parameter priors. Loss func- 
tions (optimization objectives) are often motivated and induced by prob- 
abilistic models. We will look at the connection between loss functions 
and the underlying prior assumptions that induce these losses. 

= Uncertainty modeling. In any practical setting, we have access to only 
a finite, potentially large, amount of (training) data for selecting the 
model class and the corresponding parameters. Given that this finite 
amount of training data does not cover all possible scenarios, we may 
want to describe the remaining parameter uncertainty to obtain a mea- 
sure of confidence of the model’s prediction at test time; the smaller the 
training set, the more important uncertainty modeling. Consistent mod- 
eling of uncertainty equips model predictions with confidence bounds. 


In the following, we will be using the mathematical tools from Chap- 
ters 3, 5, 6 and 7 to solve linear regression problems. We will discuss 
maximum likelihood and maximum a posteriori (MAP) estimation to find 
optimal model parameters. Using these parameter estimates, we will have 
a brief look at generalization errors and overfitting. Toward the end of 
this chapter, we will discuss Bayesian linear regression, which allows us to 
reason about model parameters at a higher level, thereby removing some 
of the problems encountered in maximum likelihood and MAP estimation. 
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9.1 Problem Formulation 


Because of the presence of observation noise, we will adopt a probabilis- 
tic approach and explicitly model the noise using a likelihood function. 
More specifically, throughout this chapter, we consider a regression prob- 
lem with the likelihood function 


p(y|a) =N(y| f(x), 07). 


Here, x € R? are inputs and y € R are noisy function values (targets). 
With (9.1), the functional relationship between «æ and y is given as 


(9.1) 


y= f(a) te, (9.2) 


where € ~ N (0, o?) is independent, identically distributed (i.i.d.) Gaus- 
sian measurement noise with mean 0 and variance o°. Our objective is 
to find a function that is close (similar) to the unknown function f that 
generated the data and that generalizes well. 

In this chapter, we focus on parametric models, i.e., we choose a para- 
metrized function and find parameters 0 that “work well” for modeling the 
data. For the time being, we assume that the noise variance g? is known 
and focus on learning the model parameters 0. In linear regression, we 
consider the special case that the parameters 0 appear linearly in our 
model. An example of linear regression is given by 


ply|z,0)=N(y|x'0, o°) 
e~ N (0, o°), 


(9.3) 


<4 y=g'0+e, (9.4) 


where 0 € R? are the parameters we seek. The class of functions de- 
scribed by (9.4) are straight lines that pass through the origin. In (9.4), 
we chose a parametrization f(x) = x'0. 

The likelihood in (9.3) is the probability density function of y evalu- 
ated at x' @. Note that the only source of uncertainty originates from the 
observation noise (as x and @ are assumed known in (9.3)). Without ob- 
servation noise, the relationship between æ and y would be deterministic 
and (9.3) would be a Dirac delta. 


Example 9.1 

For x,0 € R the linear regression model in (9.4) describes straight lines 
(linear functions), and the parameter 0 is the slope of the line. Fig- 
ure 9.2(a) shows some example functions for different values of 0. 


The linear regression model in (9.3)—(9.4) is not only linear in the pa- 
rameters, but also linear in the inputs x. Figure 9.2(a) shows examples 
of such functions. We will see later that y = @' (a)@ for nonlinear trans- 
formations ¢ is also a linear regression model because “linear regression” 
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(c) Maximum likelihood esti- 
mate. 


(a) Example functions (straight 
lines) that can be described us- 
ing the linear model in (9.4). 


(b) Training set. 


refers to models that are “linear in the parameters”, i.e., models that de- 
scribe a function by a linear combination of input features. Here, a “fea- 
ture” is a representation ġ(x) of the inputs x. 

In the following, we will discuss in more detail how to find good pa- 
rameters 0 and how to evaluate whether a parameter set “works wel”. 
For the time being, we assume that the noise variance o? is known. 


9.2 Parameter Estimation 


Consider the linear regression setting (9.4) and assume we are given a 
training set D := {(a1,y1),.--,(@n,yn)} consisting of N inputs x, € 
R?” and corresponding observations/targets y, € R,n =1,...,.N. The 
corresponding graphical model is given in Figure 9.3. Note that y; and y; 
are conditionally independent given their respective inputs x;, x; so that 
the likelihood factorizes according to 


N N 
=|] elon) = [NGI a eo). 


n=1 n=1 


where we defined ¥ := {x,...,xy} and Y := {y1,..., yn} as the sets 
of training inputs and corresponding targets, respectively. The likelihood 
and the factors p(y, |,,@) are Gaussian due to the noise distribution; 
see (9.3). 

In the following, we will discuss how to find optimal parameters 0° € 
R? for the linear regression model (9.4). Once the parameters 6” are 
found, we can predict function values by using this parameter estimate 
in (9.4) so that at an arbitrary test input x, the distribution of the corre- 
sponding target y. is 


„Yn | £1,- .., EN, 0) (9.5a) 


(9.5b) 


Dy. ate PHN ly, | BE 0" o°). (9.6) 


In the following, we will have a look at parameter estimation by maxi- 
mizing the likelihood, a topic that we already covered to some degree in 
Section 8.3. 
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9.2.1 Maximum Likelihood Estimation 


A widely used approach to finding the desired parameters Oy, is maximum 
likelihood estimation, where we find parameters Ôm, that maximize the 
likelihood (9.5b). Intuitively, maximizing the likelihood means maximiz- 
ing the predictive distribution of the training data given the model param- 
eters. We obtain the maximum likelihood parameters as 


Ou, = arg max p(y | V,0). (9.7) 
Remark. The likelihood p(y | 2, @) is not a probability distribution in @: It 
is simply a function of the parameters @ but does not integrate to 1 (i.e., 
it is unnormalized), and may not even be integrable with respect to 6. 
However, the likelihood in (9.7) is a normalized probability distribution 
in y. ro) 

To find the desired parameters Oy, that maximize the likelihood, we 
typically perform gradient ascent (or gradient descent on the negative 
likelihood). In the case of linear regression we consider here, however, 
a closed-form solution exists, which makes iterative gradient descent un- 
necessary. In practice, instead of maximizing the likelihood directly, we 
apply the log-transformation to the likelihood function and minimize the 
negative log-likelihood. 


Remark (Log-Transformation). Since the likelihood (9.5b) is a product of 
N Gaussian distributions, the log-transformation is useful since (a) it does 
not suffer from numerical underflow, and (b) the differentiation rules will 
turn out simpler. More specifically, numerical underflow will be a prob- 
lem when we multiply N probabilities, where N is the number of data 
points, since we cannot represent very small numbers, such as 10~?°°. 
Furthermore, the log-transform will turn the product into a sum of log- 
probabilities such that the corresponding gradient is a sum of individual 
gradients, instead of a repeated application of the product rule (5.46) to 
compute the gradient of a product of N terms. ro 


To find the optimal parameters Oy, of our linear regression problem, 
we minimize the negative log-likelihood 


N N 
— log p(y | X,0) = — log | [ pun | Tn, 0) = — SF log p(yn | Ln, 9) , (9.8) 


n=1 n=1 


where we exploited that the likelihood (9.5b) factorizes over the number 

of data points due to our independence assumption on the training set. 
In the linear regression model (9.4), the likelihood is Gaussian (due to 

the Gaussian additive noise term), such that we arrive at 

— ax) 0)? + const, 


(9.9) 


log p(Yn | Ln; 0) ss (Yn 


20? 


where the constant includes all terms independent of 0. Using (9.9) in the 
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negative log-likelihood (9.8), we obtain (ignoring the constant terms) 


1 N 


2 T Q)2 
£(8) = 55) (Yn — 8) (9.10a) 
n=1 
1 1 2 

= ~~ (y— X60)'(y— X60) = — lly - X0 .10b 

552 Y ) (y ) = zzl lige (9.10b) 
where we define the design matrix X := |a,,...,@y]' € R%*? as the 
collection of training inputs and y := [y1,..., yn]! € R^ as a vector that 


collects all training targets. Note that the nth row in the design matrix X 
corresponds to the training input æ„. In (9.10b), we used the fact that the 
sum of squared errors between the observations y,, and the corresponding 
model prediction x! 6 equals the squared distance between y and X90. 

With (9.10b), we have now a concrete form of the negative log-likelihood 
function we need to optimize. We immediately see that (9.10b) is quadratic 
in 0. This means that we can find a unique global solution Om for mini- 
mizing the negative log-likelihood £. We can find the global optimum by 
computing the gradient of £, setting it to 0 and solving for 0. 

Using the results from Chapter 5, we compute the gradient of £ with 
respect to the parameters as 





dé d/l 
aa ab (z0 -X0 y- x0)) (9.11a) 
1 d/z T TaT 
= zag (V Y -2y X0 +0" X" X0) (9.11b) 
1 = 
= UE +0'X'X) ERP. (9.11c) 


The maximum likelihood estimator Ôm; solves i = 0' (necessary opti- 
mality condition) and we obtain 





“ =0' @9 gi x'X=y'X (9.12a) 
< 0w =Y X(X' X)! (9.12b) 
<> Ow = (X'X) 1X Ty. (9.120) 


We could right-multiply the first equation by (X TX )~t because X 'X is 
positive definite if rk(X ) = D, where rk(X) denotes the rank of X. 


Remark. Setting the gradient to 0' is a necessary and sufficient condition, 
and we obtain a global minimum since the Hessian V3L(0) = X'X € 
R?*°” is positive definite. 

Remark. The maximum likelihood solution in (9.12c) requires us to solve 
a system of linear equations of the form A@ = b with A = (X' X) and 
b= X'y. © 
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Example 9.2 (Fitting Lines) 

Let us have a look at Figure 9.2, where we aim to fit a straight line f(x) = 
0x, where 6 is an unknown slope, to a dataset using maximum likelihood 
estimation. Examples of functions in this model class (straight lines) are 
shown in Figure 9.2(a). For the dataset shown in Figure 9.2(b), we find 
the maximum likelihood estimate of the slope parameter 0 using (9.12c) 
and obtain the maximum likelihood linear function in Figure 9.2(c). 


Maximum Likelihood Estimation with Features 


So far, we considered the linear regression setting described in (9.4), 
which allowed us to fit straight lines to data using maximum likelihood 
estimation. However, straight lines are not sufficiently expressive when it 
comes to fitting more interesting data. Fortunately, linear regression offers 
us a way to fit nonlinear functions within the linear regression framework: 
Since “linear regression” only refers to “linear in the parameters”, we can 
perform an arbitrary nonlinear transformation (a) of the inputs a and 
then linearly combine the components of this transformation. The corre- 
sponding linear regression model is 


p(y| ,8) = N(y| b' (x), 0”) 


9.13 
y= (w)O+e= Y> ds (a) +€, 


where ¢ : R? — R* is a (nonlinear) transformation of the inputs x and 
¢, : RP — Ris the kth component of the feature vector œ. Note that the 
model parameters @ still appear only linearly. 


Example 9.3 (Polynomial Regression) 
We are concerned with a regression problem y = œ ' (x)0+e, where xz € R 
and @ € R*. A transformation that is often used in this context is 


o(x) ; 


o(x) = ae > r ER“. (9.14) 
br-ı(x) ; 
i 


This means that we “lift” the original one-dimensional input space into 
a K-dimensional feature space consisting of all monomials z” for k = 
0,..., K — 1. With these features, we can model polynomials of degree 
< K-—1 within the framework of linear regression: A polynomial of degree 
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K —1is 
ea 
a= ere 0, (9.15) 
k=0 
where ¢ is defined in (9.14) and 8 = [@%,...,@«-1|' € RË contains the 
(linear) parameters 6;. 


Let us now have a look at maximum likelihood estimation of the param- 
eters @ in the linear regression model (9.13). We consider training inputs 
£n € R? and targets y, € R, n = 1,..., N, and define the feature matrix 
(design matrix) as 


T Go(@i) = r-z) 
$ oy Po(@2) > dx-1(#2) 


® := ER”, (9.16) 


T . . 
@: (e) Polan) +++ Pr-ilæn) 
where ®;; = ġ;(x:) and ¢;: R? > R. 


Example 9.4 (Feature Matrix for Second-order Polynomials) 
For a second-order polynomial and N training points ztn E€ R,n = 
1,..., N, the feature matrix is 


E 
Le ees 
© = : (9.17) 


f TN | 


With the feature matrix ® defined in (9.16), the negative log-likelihood 
for the linear regression model (9.13) can be written as 


—log p(V | 4,0) = > (y — ®0)' (y — ®6) + const. (9.18) 


o2 
Comparing (9.18) with the negative log-likelihood in (9.10b) for the “fea- 
ture-free” model, we immediately see we just need to replace X with ®. 
Since both X and ® are independent of the parameters @ that we wish to 
optimize, we arrive immediately at the maximum likelihood estimate 


Ou = (P B) tP y (9.19) 
for the linear regression problem with nonlinear features defined in (9.13). 


Remark. When we were working without features, we required X ' X to 
be invertible, which is the case when rk(X) = D, i.e., the columns of X 
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are linearly independent. In (9.19), we therefore require ®' ® ¢ R*** 
to be invertible. This is the case if and only if rk(®) = K. © 


Example 9.5 (Maximum Likelihood Polynomial Fit) 
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T Tt 
(a) Regression dataset. (b) Polynomial of degree 4 determined by max- 


imum likelihood estimation. 


Consider the dataset in Figure 9.4(a). The dataset consists of N = 10 
pairs (tp, Yn), Where x, ~ U[—5, 5] and y, = —sin(z,,/5) + cos(x,) +, 
where e ~ N (0, 0.2?). 

We fit a polynomial of degree 4 using maximum likelihood estimation, 
i.e., parameters Ôm; are given in (9.19). The maximum likelihood estimate 
yields function values @' (a,)O@y, at any test location x,. The result is 
shown in Figure 9.4(b). 


Estimating the Noise Variance 


Thus far, we assumed that the noise variance o° is known. However, we 
can also use the principle of maximum likelihood estimation to obtain the 
maximum likelihood estimator o{,, for the noise variance. To do this, we 
follow the standard procedure: We write down the log-likelihood, com- 
pute its derivative with respect to co? > 0, set it to 0, and solve. The 
log-likelihood is given by 


log p | X, 0, a°) = 5 log N (yn | $` (£n), a°’) (9.20a) 
n=1 
= 30 (-Frest2x) ~ Flogo? -zilun — 87 (2,)0}) 0.208 
n=1 2 a 
= log a? — 13 S: — p' (a,)0) + const. (9.20c) 
p n=1 
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The partial derivative of the log-likelihood with respect to ø? is then 
ðlogpV|¥X,0,) N 1 





Jo =e TE =0 (9.21a) 

N s 
k .21b 
20? 2a4 @ ) 

so that we identify 
S TS 
Foa = T 2 

Sap = dm — p' (a,)0)?. (9.22) 


Therefore, the maximum likelihood estimate of the noise variance is the 
empirical mean of the squared distances between the noise-free function 
values | (a,,)0 and the corresponding noisy observations y,, at input lo- 
cations z,,. 


9.2.2 Overfitting in Linear Regression 


We just discussed how to use maximum likelihood estimation to fit lin- 
ear models (e.g., polynomials) to data. We can evaluate the quality of 
the model by computing the error/loss incurred. One way of doing this 
is to compute the negative log-likelihood (9.10b), which we minimized 
to determine the maximum likelihood estimator. Alternatively, given that 
the noise parameter o° is not a free model parameter, we can ignore the 
scaling by 1/o?, so that we end up with a squared-error-loss function 
ly — o|’. Instead of using this squared loss, we often use the root mean 
square error (RMSE) 


1 2 
— |iy - 0|? = 
y5 ly- 20l 


which (a) allows us to compare errors of datasets with different sizes 
and (b) has the same scale and the same units as the observed func- 
tion values y„. For example, if we fit a model that maps post-codes (æ 
is given in latitude, longitude) to house prices (y-values are EUR) then 
the RMSE is also measured in EUR, whereas the squared error is given 
in EUR?. If we choose to include the factor o° from the original negative 
log-likelihood (9.10b), then we end up with a unitless objective, i.e., in 
the preceding example, our objective would no longer be in EUR or EUR?. 

For model selection (see Section 8.6), we can use the RMSE (or the 
negative log-likelihood) to determine the best degree of the polynomial by 
finding the polynomial degree M that minimizes the objective. Given that 
the polynomial degree is a natural number, we can perform a brute-force 
search and enumerate all (reasonable) values of M. For a training set of 
size N it is sufficient to test 0 < M < N — 1. For M < N, the maximum 
likelihood estimator is unique. For M > N, we have more parameters 
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than data points, and would need to solve an underdetermined system of 
linear equations (®'@ in (9.19) would also no longer be invertible) so 
that there are infinitely many possible maximum likelihood estimators. 

Figure 9.5 shows a number of polynomial fits determined by maximum 
likelihood for the dataset from Figure 9.4(a) with N = 10 observations. 
We notice that polynomials of low degree (e.g., constants (M = 0) or 
linear (M = 1)) fit the data poorly and, hence, are poor representations 
of the true underlying function. For degrees M = 3,...,6, the fits look 
plausible and smoothly interpolate the data. When we go to higher-degree 
polynomials, we notice that they fit the data better and better. In the ex- 
treme case of M = N—1 = 9, the function will pass through every single 
data point. However, these high-degree polynomials oscillate wildly and 
are a poor representation of the underlying function that generated the 
data, such that we suffer from overfitting. 

Remember that the goal is to achieve good generalization by making 
accurate predictions for new (unseen) data. We obtain some quantita- 
tive insight into the dependence of the generalization performance on the 
polynomial of degree M by considering a separate test set comprising 200 
data points generated using exactly the same procedure used to generate 
the training set. As test inputs, we chose a linear grid of 200 points in the 
interval of |—5, 5]. For each choice of 7, we evaluate the RMSE (9.23) for 
both the training data and the test data. 

Looking now at the test error, which is a qualitive measure of the gen- 
eralization properties of the corresponding polynomial, we notice that ini- 
tially the test error decreases; see Figure 9.6 (orange). For fourth-order 
polynomials, the test error is relatively low and stays relatively constant up 
to degree 5. However, from degree 6 onward the test error increases signif- 
icantly, and high-order polynomials have very bad generalization proper- 
ties. In this particular example, this also is evident from the corresponding 
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maximum likelihood fits in Figure 9.5. Note that the training error (blue 
curve in Figure 9.6) never increases when the degree of the polynomial in- 
creases. In our example, the best generalization (the point of the smallest 
test error) is obtained for a polynomial of degree M = 4. 


9.2.3 Maximum A Posteriori Estimation 


We just saw that maximum likelihood estimation is prone to overfitting. 
We often observe that the magnitude of the parameter values becomes 
relatively large if we run into overfitting (Bishop, 2006). 

To mitigate the effect of huge parameter values, we can place a prior 
distribution p(@) on the parameters. The prior distribution explicitly en- 
codes what parameter values are plausible (before having seen any data). 
For example, a Gaussian prior p(0) = N (0, 1) on a single parameter 
6 encodes that parameter values are expected lie in the interval [—2, 2] 
(two standard deviations around the mean value). Once a dataset Vv, Y 
is available, instead of maximizing the likelihood we seek parameters that 
maximize the posterior distribution p(@|%,)). This procedure is called 
maximum a posteriori (MAP) estimation. 

The posterior over the parameters 0, given the training data X, V, is 
obtained by applying Bayes’ theorem (Section 6.3) as 


pO |X, 0)p(0) 
p(Y |X) 


Since the posterior explicitly depends on the parameter prior p(@), the 
prior will have an effect on the parameter vector we find as the maximizer 
of the posterior. We will see this more explicitly in the following. The 
parameter vector Oyap that maximizes the posterior (9.24) is the MAP 
estimate. 

To find the MAP estimate, we follow steps that are similar in flavor 
to maximum likelihood estimation. We start with the log-transform and 
compute the log-posterior as 


log p(@| Vv, VY) = log p(V | V, A) + log p(@) + const , (9.25) 


p(@| X,Y) = (9.24) 
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where the constant comprises the terms that are independent of 0. We see 
that the log-posterior in (9.25) is the sum of the log-likelihood p() | V, @) 
and the log-prior log p(@) so that the MAP estimate will be a “compromise” 
between the prior (our suggestion for plausible parameter values before 
observing data) and the data-dependent likelihood. 

To find the MAP estimate Oyap, we minimize the negative log-posterior 
distribution with respect to 0, i.e., we solve 





Omar E arg min{— log p(Y | X, 0) — log p(0)}. (9.26) 
The gradient of the negative log-posterior with respect to @ is 
dlogp(0|¥, Yy) _ _dlogp(Y|¥,0) _dlogp(0) (9.27) 
dé dé dð ’ i 


where we identify the first term on the right-hand side as the gradient of 
the negative log-likelihood from (9.11c). 

With a (conjugate) Gaussian prior p(@) = M (0, bI ) on the parameters 
0, the negative log-posterior for the linear regression setting (9.13), we 
obtain the negative log posterior 


1 1 
— log p(8|¥,Y) = sGy - £0)" (y— 66) + sae 2 + const. (9.28) 


Here, the first term corresponds to the contribution from the log-likelihood, 

and the second term originates from the log-prior. The gradient of the log- 

posterior with respect to the parameters @ is then 

dlogp(0| X,Y) _ 1 rar z Ii 
—_—-___—_=—(0 ® @-y ®)4+—0 . 
dð a ae 
We will find the MAP estimate Op by setting this gradient to 0' and 
solving for Omap. We obtain 


(9.29) 





1 1 
Oe 2 —y'®)+ ae =0' (9.30a) 

TLT 1 l r T 

= 0 (z +a) t0 (9.30b) 
= 2 

= 0 (e's 4: Tr) =y (9.30c) 
T T T a 

<>0 =y P| PD P+ pe (9.30d) 


so that the MAP estimate is (by transposing both sides of the last equality) 


2: —1 
Orap = (a F 71) d'y. (9.31) 
Comparing the MAP estimate in (9.31) with the maximum likelihood es- 
timate in (9.19), we see that the only difference between both solutions 
is the additional term J in the inverse matrix. This term ensures that 
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®'@ + ¢z1 is symmetric and strictly positive definite (i.e., its inverse 
exists and the MAP estimate is the unique solution of a system of linear 
equations). Moreover, it reflects the impact of the regularizer. 


Example 9.6 (MAP Estimation for Polynomial Regression) 

In the polynomial regression example from Section 9.2.1, we place a Gaus- 
sian prior p(@) = N (0, I) on the parameters @ and determine the MAP 
estimates according to (9.31). In Figure 9.7, we show both the maximum 
likelihood and the MAP estimates for polynomials of degree 6 (left) and 
degree 8 (right). The prior (regularizer) does not play a significant role 
for the low-degree polynomial, but keeps the function relatively smooth 
for higher-degree polynomials. Although the MAP estimate can push the 
boundaries of overfitting, it is not a general solution to this problem, so 
we need a more principled approach to tackle overfitting. 








+ Training data 
— MLE 
— MAP 








+ Training data 
— MLE 
—4| — MAP —4 























—4 =2 0 2 4 —4 


(a) Polynomials of degree 6. (b) Polynomials of degree 8. 


9.2.4 MAP Estimation as Regularization 


Instead of placing a prior distribution on the parameters 0, it is also pos- 
sible to mitigate the effect of overfitting by penalizing the amplitude of 
the parameter by means of regularization. In regularized least squares, we 
consider the loss function 


lly — 801? + à 0l} (9.32) 


which we minimize with respect to 0 (see Section 8.2.3). Here, the first 
term is a data-fit term (also called misfit term), which is proportional to 
the negative log-likelihood; see (9.10b). The second term is called the 
regularizer, and the regularization parameter \ > 0 controls the “strict- 
ness” of the regularization. 


Remark. Instead of the Euclidean norm ||-||,, we can choose any p-norm 
||-||, in (9.32). In practice, smaller values for p lead to sparser solutions. 
Here, “sparse” means that many parameter values 6, = 0, which is also 
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useful for variable selection. For p = 1, the regularizer is called LASSO 
(least absolute shrinkage and selection operator) and was proposed by Tib- 
shirani (1996). Q 


The regularizer ) ||6||; in (9.32) can be interpreted as a negative log- 
Gaussian prior, which we use in MAP estimation; see (9.26). More specif- 
ically, with a Gaussian prior p(0) = M (0, bI ), we obtain the negative 
log-Gaussian prior 


1 
— log p(@) = z2 


so that for \ = xB the regularization term and the negative log-Gaussian 
prior are identical. 

Given that the regularized least-squares loss function in (9.32) consists 
of terms that are closely related to the negative log-likelihood plus a neg- 
ative log-prior, it is not surprising that, when we minimize this loss, we 
obtain a solution that closely resembles the MAP estimate in (9.31). More 


specifically, minimizing the regularized least-squares loss function yields 


Oris = (B'AIT S'y, (9.34) 


|||; + const (9.33) 


which is identical to the MAP estimate in (9.31) for \ = z; where øg? is 
the noise variance and b? the variance of the (isotropic) Gaussian prior 
p(0) =N (0, TI). 

So far, we have covered parameter estimation using maximum likeli- 
hood and MAP estimation where we found point estimates 6° that op- 
timize an objective function (likelihood or posterior). We saw that both 
maximum likelihood and MAP estimation can lead to overfitting. In the 
next section, we will discuss Bayesian linear regression, where we use 
Bayesian inference (Section 8.4) to find a posterior distribution over the 
unknown parameters, which we subsequently use to make predictions. 
More specifically, for predictions we will average over all plausible sets of 
parameters instead of focusing on a point estimate. 


9.3 Bayesian Linear Regression 


Previously, we looked at linear regression models where we estimated the 
model parameters 0, e.g., by means of maximum likelihood or MAP esti- 
mation. We discovered that MLE can lead to severe overfitting, in particu- 
lar, in the small-data regime. MAP addresses this issue by placing a prior 
on the parameters that plays the role of a regularizer. 

Bayesian linear regression pushes the idea of the parameter prior a step 
further and does not even attempt to compute a point estimate of the 
parameters, but instead the full posterior distribution over the parameters 
is taken into account when making predictions. This means we do not fit 
any parameters, but we compute a mean over all plausible parameters 
settings (according to the posterior). 
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9.3.1 Model 
In Bayesian linear regression, we consider the model 
prior p(@) = N (mo, So) , 


35 
likelihood p(y|x,0) =N(y|¢@'(x)@, 0”), on 


where we now explicitly place a Gaussian prior p(@) = N (mo, So) on 9, 
which turns the parameter vector into a random variable. This allows us 
to write down the corresponding graphical model in Figure 9.8, where we 
made the parameters of the Gaussian prior on 0 explicit. The full proba- 
bilistic model, i.e., the joint distribution of observed and unobserved ran- 
dom variables, y and 0, respectively, is 


p(y, 9 |x) = p(y| x, @)p() . (9.36) 


9.3.2 Prior Predictions 


In practice, we are usually not so much interested in the parameter values 
@ themselves. Instead, our focus often lies in the predictions we make 
with those parameter values. In a Bayesian setting, we take the parameter 
distribution and average over all plausible parameter settings when we 
make predictions. More specifically, to make predictions at an input æ£,, 
we integrate out 6 and obtain 


p(ys |.) = i ply. |æ., 0)p(0)d0 = Eo[p(y.|2.,0)], 9.37) 


which we can interpret as the average prediction of y, | z,, 6 for all plau- 
sible parameters @ according to the prior distribution p(@). Note that pre- 
dictions using the prior distribution only require us to specify the input 
£., but no training data. 

In our model (9.35), we chose a conjugate (Gaussian) prior on 0 so 
that the predictive distribution is Gaussian as well (and can be computed 
in closed form): With the prior distribution p(@) = N (mo, So), we obtain 
the predictive distribution as 


Plys |2) =N (Q' (£ )Mo, 6" (ax)SoP(w.) +07) , (9.38) 


where we exploited that (i) the prediction is Gaussian due to conjugacy 
(see Section 6.6) and the marginalization property of Gaussians (see Sec- 
tion 6.5), Gi) the Gaussian noise is independent so that 


Vilys] = Volo" (x,)0] + Vele], (9.39) 


and (iii) y, is a linear transformation of 0 so that we can apply the rules 
for computing the mean and covariance of the prediction analytically by 
using (6.50) and (6.51), respectively. In (9.38), the term go (x1) Sop(xx) 
in the predictive variance explicitly accounts for the uncertainty associated 
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with the parameters 0, whereas o? is the uncertainty contribution due to 
the measurement noise. 

If we are interested in predicting noise-free function values f(x.) = 
@' («.)@ instead of the noise-corrupted targets y, we obtain 


P(f(#-)) =N(G! (wx), G! (w-)SoP(@.)) , 


which only differs from (9.38) in the omission of the noise variance g? in 
the predictive variance. 


(9.40) 


Remark (Distribution over Functions). Since we can represent the distri- 
bution p(@) using a set of samples 0; and every sample 6; gives rise to a 
function f;(-) = 0; @(-), it follows that the parameter distribution p(0) 
induces a distribution p( f(-)) over functions. Here we use the notation (-) 
to explicitly denote a functional relationship. > 


Example 9.7 (Prior over Functions) 


























—4 a? 0 2 4 
T g 





(a) Prior distribution over functions. (b) Samples from the prior distribution over 


functions. 


Let us consider a Bayesian linear regression problem with polynomials 
of degree 5. We choose a parameter prior p(0) = N (0, 1I). Figure 9.9 
visualizes the induced prior distribution over functions (shaded area: dark 
gray: 67% confidence bound; light gray: 95% confidence bound) induced 
by this parameter prior, including some function samples from this prior. 

A function sample is obtained by first sampling a parameter vector 
0; ~ p(0) and then computing f:(-) = 0] ¢(-). We used 200 input lo- 
cations x, © [—5,5] to which we apply the feature function ¢(-). The 
uncertainty (represented by the shaded area) in Figure 9.9 is solely due to 
the parameter uncertainty because we considered the noise-free predictive 
distribution (9.40). 


So far, we looked at computing predictions using the parameter prior 
p(@). However, when we have a parameter posterior (given some train- 
ing data V, Y), the same principles for prediction and inference hold 
as in (9.37) — we just need to replace the prior p(@) with the posterior 
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p(@| X,Y). In the following, we will derive the posterior distribution in 
detail before using it to make predictions. 


9.3.3 Posterior Distribution 


Given a training set of inputs x, € R? and corresponding observations 
Yn E R, n = 1,...,N, we compute the posterior over the parameters 
using Bayes’ theorem as 


PLY |X, @)p() 
p|) 
where X is the set of training inputs and Y the collection of correspond- 


ing training targets. Furthermore, p(V | 4,0) is the likelihood, p(@) the 
parameter prior, and 


p9 |#) = j p(V |X, 0)p(9)d0 = Eo[p( | ¥, 0) (9.42) 


p(@| X,Y) = (9.41) 


the marginal likelihood/evidence, which is independent of the parameters 
6 and ensures that the posterior is normalized, i.e., it integrates to 1. We 
can think of the marginal likelihood as the likelihood averaged over all 
possible parameter settings (with respect to the prior distribution p(@)). 


Theorem 9.1 (Parameter Posterior). In our model (9.35), the parameter 
posterior (9.41) can be computed in closed form as 


p(0| X,Y) =N(O@|my, Sw), (9.43a) 
Sy =(Sp'+o76'S)!, (9.43b) 
my = Sv(Sj'm +o-7®@'y), (9.43c) 


where the subscript N indicates the size of the training set. 


Proof Bayes’ theorem tells us that the posterior p(@| 4, ) is propor- 
tional to the product of the likelihood p(Y | 7,0) and the prior p(@): 


p |X, 0)p(0) 


Posterior p(@|X,Y)= DOTA) (9.44a) 
Likelihood p(Y|¥,0) =N (y|890, o7I) (9.44b) 
Prior p(0) =N (0| mo, So). (9.44c) 


Instead of looking at the product of the prior and the likelihood, we 
can transform the problem into log-space and solve for the mean and 
covariance of the posterior by completing the squares. 

The sum of the log-prior and the log-likelihood is 


log N (y | 80, o°IT) + log N (0 | mo, So) (9.45a) 
= -1 (oy — 0)! (y — 0) + (0 — mo)! S71 (0 — mo)) + const 
(9.45b) 
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where the constant contains terms independent of 0. We will ignore the 

constant in the following. We now factorize (9.45b), which yields 
1 

=< (oy "y — 20 *y' 604 6'o *6'S0+0'S,'0 

— 2m S50 + mi Sz mo) 


(9.46a) 


1 i 
=—5(0' (0 ?&' + 8,1)0 — 2(0 °@"y + S3 mo)" 0) + const, 
(9.46b) 


where the constant contains the black terms in (9.46a), which are inde- 
pendent of 0. The orange terms are terms that are linear in 0, and the 
blue terms are the ones that are quadratic in 0. Inspecting (9.46b), we 
find that this equation is quadratic in 0. The fact that the unnormalized 
log-posterior distribution is a (negative) quadratic form implies that the 
posterior is Gaussian, i.e., 


p(O| X, Y) = exp(log p( | X, Y)) x exp(log pY | X, 0) + log p(9)) 
(9.47a) 
1 
x exp ( — z0 CTS + $5')0 — 2(0-7@ "y+ Si 'mo)")) j 
(9.47b) 
where we used (9.46b) in the last expression. 
The remaining task is it to bring this (unnormalized) Gaussian into the 
form that is proportional to M (0 |my, S N)» i.e., we need to identify the 


mean my and the covariance matrix Sy. To do this, we use the concept 
of completing the squares. The desired log-posterior is 


1 
lg N (0| my, Sn) = = —my)'Sx'(@— my) + const (9.48a) 
1 
ie (0'S;'0—2mj,Sy'04+mySy mn). (9.48b) 
Here, we factorized the quadratic form (9 — my)' Sj'(8 — my) into a 
term that is quadratic in 0 alone (blue), a term that is linear in 8 (orange), 


and a constant term (black). This allows us now to find Sy and my by 
matching the colored expressions in (9.46b) and (9.48b), which yields 


Sy =®'o7I64+ S,' (9.49a) 
<> Sy =(0 °@' S455") (9.49b) 
and 
mySy =(o ?S' y+ So'm)! (9.50a) 
<> my = Sy(o ?® y+ Spm). (9.50b) 
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Remark (General Approach to Completing the Squares). If we are given 
an equation 


x' Ax — 2a' x + const, , (9.51) 


where A is symmetric and positive definite, which we wish to bring into 
the form 


(a — ys)’ E(x — u) + const , (9.52) 

we can do this by setting 
SSA. (9.53) 
p= D ta (9.54) 


and const, = const, — p' Dp. © 

We can see that the terms inside the exponential in (9.47b) are of the 
form (9.51) with 

A:=0 ° + S5, (9.55) 

a := 0°! y + S7 mo. (9.56) 





Since A, a can be difficult to identify in equations like (9.46a), it is of- 
ten helpful to bring these equations into the form (9.51) that decouples 
quadratic term, linear terms, and constants, which simplifies finding the 
desired solution. 


9.3.4 Posterior Predictions 


In (9.37), we computed the predictive distribution of y, at a test input 
x, using the parameter prior p(@). In principle, predicting with the pa- 
rameter posterior p(0 |X, V) is not fundamentally different given that 
in our conjugate model the prior and posterior are both Gaussian (with 
different parameters). Therefore, by following the same reasoning as in 
Section 9.3.2, we obtain the (posterior) predictive distribution 


plu |X, V, 2.) = | p(y. |æ», 0)P(0 |X, V)d0 (9.57a) 
= [x |p| (x.)0, o?)N(O|my, Sy)d@ — (9.57b) 
=N(y.| 6! (x.)mn, $' (a.)Sv(a.) +07). (9.570) 


The term @' («,.)So(a,) reflects the posterior uncertainty associated 
with the parameters @. Note that Sj depends on the training inputs 
through ®; see (9.43b). The predictive mean go (a..) my coincides with 
the predictions made with the MAP estimate Omar. 
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Remark (Marginal Likelihood and Posterior Predictive Distribution). By 
replacing the integral in (9.57a), the predictive distribution can be equiv- 
alently written as the expectation Eg) 7 5|p(y. | 2, 0)], where the expec- 
tation is taken with respect to the parameter posterior p(@| 7, )). 
Writing the posterior predictive distribution in this way highlights a 
close resemblance to the marginal likelihood (9.42). The key difference 
between the marginal likelihood and the posterior predictive distribution 
are (i) the marginal likelihood can be thought of predicting the training 
targets y and not the test targets y,, and (ii) the marginal likelihood av- 
erages with respect to the parameter prior and not the parameter poste- 
rior. e 


Remark (Mean and Variance of Noise-Free Function Values). In many 
cases, we are not interested in the predictive distribution p(y, | X, V, £4) 
of a (noisy) observation y,. Instead, we would like to obtain the distribu- 
tion of the (noise-free) function values f(a.) = | (a,)@. We determine 
the corresponding moments by exploiting the properties of means and 
variances, which yields 


E[f(x.) | X,Y] = Eo[p" (w.)0|¥,Y] = p' (x,)Ee[0 | X, X] 
= p' (£. )My = Myx) , 


Volf (2) |X, VY] = Volh’ (x.)0 |X, V] 


= ġ' (x,)Vol0 | X, Y]olx.) (9.59) 


We see that the predictive mean is the same as the predictive mean for 
noisy observations as the noise has mean 0, and the predictive variance 
only differs by 7”, which is the variance of the measurement noise: When 
we predict noisy function values, we need to include o? as a source of 
uncertainty, but this term is not needed for noise-free predictions. Here, 
the only remaining uncertainty stems from the parameter posterior © 


(9.58) 





Remark (Distribution over Functions). The fact that we integrate out the 
parameters @ induces a distribution over functions: If we sample 0; ~ 
p(@| X,Y) from the parameter posterior, we obtain a single function re- 
alization 0; #(-). The mean function, i.e., the set of all expected function 
values Eg[f(-)| 0,4, ¥], of this distribution over functions is mj (-). 
The (marginal) variance, i.e., the variance of the function f(-), is given by 


p' (Snol). © 


Example 9.8 (Posterior over Functions) 

Let us revisit the Bayesian linear regression problem with polynomials 
of degree 5. We choose a parameter prior p(0) = N (0, +1). Figure 9.9 
visualizes the prior over functions induced by the parameter prior and 
sample functions from this prior. 
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Figure 9.10 shows the posterior over functions that we obtain via 
Bayesian linear regression. The training dataset is shown in panel (a); 
panel (b) shows the posterior distribution over functions, including the 
functions we would obtain via maximum likelihood and MAP estimation. 
The function we obtain using the MAP estimate also corresponds to the 
posterior mean function in the Bayesian linear regression setting. Panel (c) 
shows some plausible realizations (samples) of functions under that pos- 
terior over functions. 
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(a) Training data. (b) Posterior over functions rep- (c) Samples from the posterior 
resented by the marginal uncer- over functions, which are in- 
tainties (shaded) showing the duced by the samples from the 
67% and 95% predictive con- parameter posterior. 
fidence bounds, the maximum 
likelihood estimate (MLE) and 
the MAP estimate (MAP), the 
latter of which is identical to 
the posterior mean function. 


Figure 9.11 shows some posterior distributions over functions induced 
by the parameter posterior. For different polynomial degrees M, the left 
panels show the maximum likelihood function @,,,@(-), the MAP func- 
tion 8),,p0(-) (which is identical to the posterior mean function), and the 
67% and 95% predictive confidence bounds obtained by Bayesian linear 
regression, represented by the shaded areas. 

The right panels show samples from the posterior over functions: Here, 
we sampled parameters 0; from the parameter posterior and computed 
the function ' (a,)@;, which is a single realization of a function under 
the posterior distribution over functions. For low-order polynomials, the 
parameter posterior does not allow the parameters to vary much: The 
sampled functions are nearly identical. When we make the model more 
flexible by adding more parameters (i.e., we end up with a higher-order 
polynomial), these parameters are not sufficiently constrained by the pos- 
terior, and the sampled functions can be easily visually separated. We also 
see in the corresponding panels on the left how the uncertainty increases, 
especially at the boundaries. 

Although for a seventh-order polynomial the MAP estimate yields a rea- 
sonable fit, the Bayesian linear regression model additionally tells us that 
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(a) Posterior distribution for polynomials of degree M = 3 (left) and samples from the pos- 
terior over functions (right). 
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(b) Posterior distribution for polynomials of degree M = 5 (left) and samples from the 
posterior over functions (right). 











(c) Posterior distribution for polynomials of degree M = 7 (left) and samples from the pos- 
terior over functions (right). 


the posterior uncertainty is huge. This information can be critical when 


we use these predictions in a decision-making system, where bad deci- 
sions can have significant consequences (e.g., in reinforcement learning 
or robotics). 
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9.3.5 Computing the Marginal Likelihood 


In Section 8.6.2, we highlighted the importance of the marginal likelihood 
for Bayesian model selection. In the following, we compute the marginal 
likelihood for Bayesian linear regression with a conjugate Gaussian prior 
on the parameters, i.e., exactly the setting we have been discussing in this 


chapter. 
Just to recap, we consider the following generative process: 
0 ~N (mo, So) (9.60a) 
Yn | £n, 0 ~N (£10, 0°), (9.60b) 
n = 1,..., N. The marginal likelihood is given by 
po) = f PIX, O)plO)ao (9.61a) 
- f N(y| XO, 02 I)N(0| mo, So) dd, (9.61b) 


where we integrate out the model parameters 0. We compute the marginal 
likelihood in two steps: First, we show that the marginal likelihood is 
Gaussian (as a distribution in y); second, we compute the mean and co- 
variance of this Gaussian. 


1. The marginal likelihood is Gaussian: From Section 6.5.2, we know that 
(i) the product of two Gaussian random variables is an (unnormalized) 
Gaussian distribution, and (ii) a linear transformation of a Gaussian 
random variable is Gaussian distributed. In (9.61b), we require a linear 
transformation to bring N (y | X0, o7F) into the form N (0 | p, ©) for 
some u, ®©. Once this is done, the integral can be solved in closed form. 
The result is the normalizing constant of the product of the two Gaus- 
sians. The normalizing constant itself has Gaussian shape; see (6.76). 

2. Mean and covariance. We compute the mean and covariance matrix 
of the marginal likelihood by exploiting the standard results for means 
and covariances of affine transformations of random variables; see Sec- 
tion 6.4.4. The mean of the marginal likelihood is computed as 


E[) | X] = Eo.-[X6 + €] = XEo[0] = Xmo. (9.62) 


Note that e ~ N (0, o?J) is a vector of i.i.d. random variables. The 
covariance matrix is given as 


Cov[Y|X] = Covo e| X0 + e] = Covo | X0] + o°T (9.63a) 
= X Covo|0]X' +0°I = XS X' +0°I.  (9.63b) 
Hence, the marginal likelihood is 
N 1 
pY | X) = (2r)? det(X SX’ + o°I)? (9.64a) 
-exp ( — (y — Xmo) (XSoX' +I) (y - Xmo)) 
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(a) Regression dataset consisting of noisy ob- (b) The orange dots are the projections of 

servations yn (blue) of function values f(xn) the noisy observations (blue dots) onto the 

at input locations £n. line mg. The maximum likelihood solution to 
a linear regression problem finds a subspace 
(line) onto which the overall projection er- 
ror (orange lines) of the observations is mini- 
mized. 


=N(y|Xmo, XSoX' +071). (9.64b) 


Given the close connection with the posterior predictive distribution (see 
Remark on Marginal Likelihood and Posterior Predictive Distribution ear- 
lier in this section), the functional form of the marginal likelihood should 
not be too surprising. 


9.4 Maximum Likelihood as Orthogonal Projection 


Having crunched through much algebra to derive maximum likelihood 
and MAP estimates, we will now provide a geometric interpretation of 
maximum likelihood estimation. Let us consider a simple linear regression 
setting 


y=ar0+e, e~N(0, 07), (9.65) 


in which we consider linear functions f : R — R that go through the 
origin (we omit features here for clarity). The parameter 0 determines the 
slope of the line. Figure 9.12(a) shows a one-dimensional dataset. 

With a training data set {(71,41),...,(@w,yn)} we recall the results 
from Section 9.2.1 and obtain the maximum likelihood estimator for the 
slope parameter as 

Ty)-lyT X ty 
Ou, = (XX) UX y x x 
where X = [71,...,tn]' € RY, y= [m,---,yw]' € RY. 

This means for the training inputs X we obtain the optimal (maximum 
likelihood) reconstruction of the training targets as 


X'y Xx! 
X'X XX 





ER, (9.66) 


X ôm =X Y, (9.67) 
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Figure 9.12 
Geometric 
interpretation of 
least squares. 

(a) Dataset; 

(b) maximum 
likelihood solution 
interpreted as a 
projection. 


Linear regression 
can be thought of as 
a method for solving 
systems of linear 
equations. 


Maximum 
likelihood linear 
regression performs 
an orthogonal 
projection. 
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i.e., we obtain the approximation with the minimum least-squares error 
between y and X90. 

As we are looking for a solution of y = X90, we can think of linear 
regression as a problem for solving systems of linear equations. There- 
fore, we can relate to concepts from linear algebra and analytic geometry 
that we discussed in Chapters 2 and 3. In particular, looking carefully 
at (9.67) we see that the maximum likelihood estimator Oym, in our ex- 
ample from (9.65) effectively does an orthogonal projection of y onto 
the one-dimensional subspace spanned by X. Recalling the results on or- 
thogonal projections from Section 3.8, we identi a as the projection 
matrix, Ôm as the coordinates of the projection onto the one-dimensional 
subspace of R spanned by X and X Oy, as the orthogonal projection of 
y onto this subspace. 

Therefore, the maximum likelihood solution provides also a geometri- 
cally optimal solution by finding the vectors in the subspace spanned by 
X that are “closest” to the corresponding observations y, where “clos- 
est” means the smallest (squared) distance of the function values y,, to 
x,0. This is achieved by orthogonal projections. Figure 9.12(b) shows the 
projection of the noisy observations onto the subspace that minimizes the 
squared distance between the original dataset and its projection (note that 
the x-coordinate is fixed), which corresponds to the maximum likelihood 
solution. 

In the general linear regression case where 


y =¢ġ'(x)ð +e, e~N(0, 0’) (9.68) 


with vector-valued features @(a) € R*, we again can interpret the maxi- 
mum likelihood result 





y = Om, (9.69) 
Ou = (P B) t'y (9.70) 


as a projection onto a K-dimensional subspace of R”, which is spanned 
by the columns of the feature matrix ®; see Section 3.8.2. 

If the feature functions ¢; that we use to construct the feature ma- 
trix ® are orthonormal (see Section 3.7), we obtain a special case where 
the columns of ® form an orthonormal basis (see Section 3.5), such that 
®'® = I. This will then lead to the projection 


K 

&(6'b) 'S' y= S68 'y=(S ppr |y (9.71) 

k=1 

so that the maximum likelihood projection is simply the sum of projections 
of y onto the individual basis vectors @,, i.e., the columns of ®. Further- 
more, the coupling between different features has disappeared due to the 
orthogonality of the basis. Many popular basis functions in signal process- 
ing, such as wavelets and Fourier bases, are orthogonal basis functions. 
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When the basis is not orthogonal, one can convert a set of linearly inde- 
pendent basis functions to an orthogonal basis by using the Gram-Schmidt 
process; see Section 3.8.3 and (Strang, 2003). 


9.5 Further Reading 


In this chapter, we discussed linear regression for Gaussian likelihoods 
and conjugate Gaussian priors on the parameters of the model. This al- 
lowed for closed-form Bayesian inference. However, in some applications 
we may want to choose a different likelihood function. For example, in 
a binary classification setting, we observe only two possible (categorical) 
outcomes, and a Gaussian likelihood is inappropriate in this setting. In- 
stead, we can choose a Bernoulli likelihood that will return a probability of 
the predicted label to be 1 (or 0). We refer to the books by Barber (2012), 
Bishop (2006), and Murphy (2012) for an in-depth introduction to classifi- 
cation problems. A different example where non-Gaussian likelihoods are 
important is count data. Counts are non-negative integers, and in this case 
a Binomial or Poisson likelihood would be a better choice than a Gaussian. 
All these examples fall into the category of generalized linear models, a flex- 
ible generalization of linear regression that allows for response variables 
that have error distributions other than a Gaussian distribution. The GLM 
generalizes linear regression by allowing the linear model to be related 
to the observed values via a smooth and invertible function o(-) that may 
be nonlinear so that y = o(f(a)), where f(a) = 0'¢(æ) is the linear 
regression model from (9.13). We can therefore think of a generalized 
linear model in terms of function composition y = ø o f, where f is a 
linear regression model and o the activation function. Note that although 
we are talking about “generalized linear models”, the outputs y are no 
longer linear in the parameters 0. In logistic regression, we choose the 
logistic sigmoid o( f) = oH € [0, 1], which can be interpreted as the 
probability of observing y = 1 of a Bernoulli random variable y € {0, 1}. 
The function o(-) is called transfer function or activation function, and its 
inverse is called the canonical link function. From this perspective, it is 
also clear that generalized linear models are the building blocks of (deep) 
feedforward neural networks: If we consider a generalized linear model 
y = o( Ax + b), where A is a weight matrix and b a bias vector, we iden- 
tify this generalized linear model as a single-layer neural network with 
activation function o(-). We can now recursively compose these functions 
via 


Tk+1 = x 
ae) (9.72) 
Ff (Ln) = O4(Anaen + by) 
for k = 0,..., AK — 1, where ap are the input features and £g = y are 


the observed outputs, such that f,_, 0---°o f, is a K-layer deep neural 
network. Therefore, the building blocks of this deep neural network are 
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the generalized linear models defined in (9.72). Neural networks (Bishop, 
1995; Goodfellow et al., 2016) are significantly more expressive and flexi- 
ble than linear regression models. However, maximum likelihood parame- 
ter estimation is a non-convex optimization problem, and marginalization 
of the parameters in a fully Bayesian setting is analytically intractable. 

We briefly hinted at the fact that a distribution over parameters in- 
duces a distribution over regression functions. Gaussian processes (Ras- 
mussen and Williams, 2006) are regression models where the concept of 
a distribution over function is central. Instead of placing a distribution 
over parameters, a Gaussian process places a distribution directly on the 
space of functions without the “detour” via the parameters. To do so, the 
Gaussian process exploits the kernel trick (Schélkopf and Smola, 2002), 
which allows us to compute inner products between two function values 
f (xi), f(a;) only by looking at the corresponding input x;, æj. A Gaus- 
sian process is closely related to both Bayesian linear regression and sup- 
port vector regression but can also be interpreted as a Bayesian neural 
network with a single hidden layer where the number of units tends to 
infinity (Neal, 1996; Williams, 1997). Excellent introductions to Gaussian 
processes can be found in MacKay (1998) and Rasmussen and Williams 
(2006). 

We focused on Gaussian parameter priors in the discussions in this chap- 
ter, because they allow for closed-form inference in linear regression mod- 
els. However, even in a regression setting with Gaussian likelihoods, we 
may choose a non-Gaussian prior. Consider a setting, where the inputs are 
x € R? and our training set is small and of size N < D. This means that 
the regression problem is underdetermined. In this case, we can choose 
a parameter prior that enforces sparsity, i.e., a prior that tries to set as 
many parameters to 0 as possible (variable selection). This prior provides 
a stronger regularizer than the Gaussian prior, which often leads to an in- 
creased prediction accuracy and interpretability of the model. The Laplace 
prior is one example that is frequently used for this purpose. A linear re- 
gression model with the Laplace prior on the parameters is equivalent to 
linear regression with L1 regularization (LASSO) (Tibshirani, 1996). The 
Laplace distribution is sharply peaked at zero (its first derivative is discon- 
tinuous) and it concentrates its probability mass closer to zero than the 
Gaussian distribution, which encourages parameters to be 0. Therefore, 
the nonzero parameters are relevant for the regression problem, which is 
the reason why we also speak of “variable selection”. 
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Dimensionality Reduction with Principal 
Component Analysis 


Working directly with high-dimensional data, such as images, comes with 
some difficulties: It is hard to analyze, interpretation is difficult, visualiza- 
tion is nearly impossible, and (from a practical point of view) storage of 
the data vectors can be expensive. However, high-dimensional data often 
has properties that we can exploit. For example, high-dimensional data is 
often overcomplete, i.e., many dimensions are redundant and can be ex- 
plained by a combination of other dimensions. Furthermore, dimensions 
in high-dimensional data are often correlated so that the data possesses an 
intrinsic lower-dimensional structure. Dimensionality reduction exploits 
structure and correlation and allows us to work with a more compact rep- 
resentation of the data, ideally without losing information. We can think 
of dimensionality reduction as a compression technique, similar to jpeg or 
mp3, which are compression algorithms for images and music. 

In this chapter, we will discuss principal component analysis (PCA), an 
algorithm for linear dimensionality reduction. PCA, proposed by Pearson 
(1901) and Hotelling (1933), has been around for more than 100 years 
and is still one of the most commonly used techniques for data compres- 
sion and data visualization. It is also used for the identification of simple 
patterns, latent factors, and structures of high-dimensional data. In the 
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(a) Dataset with x; and x2 coordinates. 


(b) Compressed dataset where only the xı coor- 
dinate is relevant. 
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Figure 10.1 
Illustration: 
dimensionality 
reduction. (a) The 
original dataset 
does not vary much 
along the x2 
direction. (b) The 
data from (a) can be 
represented using 
the xı-coordinate 
alone with nearly no 
loss. 
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signal processing community, PCA is also known as the Karhunen-Loève 
transform. In this chapter, we derive PCA from first principles, drawing on 
our understanding of basis and basis change (Sections 2.6.1 and 2.7.2), 
projections (Section 3.8), eigenvalues (Section 4.2), Gaussian distribu- 
tions (Section 6.5), and constrained optimization (Section 7.2). 

Dimensionality reduction generally exploits a property of high-dimen- 
sional data (e.g., images) that it often lies on a low-dimensional subspace. 
Figure 10.1 gives an illustrative example in two dimensions. Although 
the data in Figure 10.1(a) does not quite lie on a line, the data does not 
vary much in the x-direction, so that we can express it as if it were on 
a line — with nearly no loss; see Figure 10.1(b). To describe the data in 
Figure 10.1(b), only the x,-coordinate is required, and the data lies in a 
one-dimensional subspace of R?. 


10.1 Problem Setting 


In PCA, we are interested in finding projections x,, of data points x,, that 
are as similar to the original data points as possible, but which have a sig- 
nificantly lower intrinsic dimensionality. Figure 10.1 gives an illustration 
of what this could look like. 

More concretely, we consider an i.i.d. dataset ¥ = {£1,..., £N}, En € 
R”, with mean 0 that possesses the data covariance matrix (6.42) 


1x 
= T 
S= N D Enta : (10.1) 


Furthermore, we assume there exists a low-dimensional compressed rep- 
resentation (code) 


Zn = B' x, € R” (10.2) 
of £n, where we define the projection matrix 
B := [bi,... bu] € RP*™. (10.3) 


We assume that the columns of B are orthonormal (Definition 3.7) so that 
bj b; = 0 if and only if i # j and bj b; = 1. We seek an M-dimensional 
subspace U C R”, dim(U) = M < D onto which we project the data. We 
denote the projected data by z,, € U, and their coordinates (with respect 
to the basis vectors b,,..., bj; of U) by z,,. Our aim is to find projections 
x, € R? (or equivalently the codes z,, and the basis vectors b,,..., bas) 
so that they are as similar to the original data x,, and minimize the loss 
due to compression. 


Example 10.1 (Coordinate Representation/Code) 
Consider R? with the canonical basis e, = [1,0]', eg = [0,1]'. From 
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Original Reconstructed 


Compressed 
RMY 
zZ 


Chapter 2, we know that x € R? can be represented as a linear combina- 
tion of these basis vectors, e.g., 


i P. (10.4) 
However, when we consider vectors of the form 
z= [| er, zeER, (10.5) 


they can always be written as Je, + ze2. To represent these vectors it is 
sufficient to remember/store the coordinate/code z of x with respect to 
the es vector. 

More precisely, the set of a vectors (with the standard vector addition 
and scalar multiplication) forms a vector subspace U (see Section 2.4) 
with dim(U) = 1 because U = span|e3]. 


In Section 10.2, we will find low-dimensional representations that re- 
tain as much information as possible and minimize the compression loss. 
An alternative derivation of PCA is given in Section 10.3, where we will 
be looking at minimizing the squared reconstruction error ||a,, — #,,||" be- 
tween the original data z,, and its projection z,,. 

Figure 10.2 illustrates the setting we consider in PCA, where z repre- 
sents the lower-dimensional representation of the compressed data x and 
plays the role of a bottleneck, which controls how much information can 
flow between a and z. In PCA, we consider a linear relationship between 
the original data a and its low-dimensional code z so that z = B'« and 
az = Bz for a suitable matrix B. Based on the motivation of thinking 
of PCA as a data compression technique, we can interpret the arrows in 
Figure 10.2 as a pair of operations representing encoders and decoders. 
The linear mapping represented by B can be thought of as a decoder, 
which maps the low-dimensional code z € R™ back into the original data 
space R?. Similarly, B' can be thought of an encoder, which encodes the 
original data x as a low-dimensional (compressed) code z. 

Throughout this chapter, we will use the MNIST digits dataset as a re- 
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Figure 10.2 
Graphical 
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z. 


The dimension of a 
vector space 
corresponds to the 
number of its basis 
vectors (see 
Section 2.6.1). 


Figure 10.3 
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pappan 


occurring example, which contains 60,000 examples of handwritten digits 
0 through 9. Each digit is a grayscale image of size 28 x 28, i.e., it contains 
784 pixels so that we can interpret every image in this dataset as a vector 
x € R**, Examples of these digits are shown in Figure 10.3. 


10.2 Maximum Variance Perspective 


Figure 10.1 gave an example of how a two-dimensional dataset can be 
represented using a single coordinate. In Figure 10.1(b), we chose to ig- 
nore the x,-coordinate of the data because it did not add too much in- 
formation so that the compressed data is similar to the original data in 
Figure 10.1(a). We could have chosen to ignore the x,-coordinate, but 
then the compressed data had been very dissimilar from the original data, 
and much information in the data would have been lost. 

If we interpret information content in the data as how “space filling” 
the dataset is, then we can describe the information contained in the data 
by looking at the spread of the data. From Section 6.4.1, we know that the 
variance is an indicator of the spread of the data, and we can derive PCA as 
a dimensionality reduction algorithm that maximizes the variance in the 
low-dimensional representation of the data to retain as much information 
as possible. Figure 10.4 illustrates this. 

Considering the setting discussed in Section 10.1, our aim is to find 
a matrix B (see (10.3)) that retains as much information as possible 
when compressing data by projecting it onto the subspace spanned by 
the columns b;,..., bj, of B. Retaining most information after data com- 
pression is equivalent to capturing the largest amount of variance in the 
low-dimensional code (Hotelling, 1933). 


Remark. (Centered Data) For the data covariance matrix in (10.1), we 
assumed centered data. We can make this assumption without loss of gen- 
erality: Let us assume that ps is the mean of the data. Using the properties 
of the variance, which we discussed in Section 6.4.4, we obtain 


V.|z] = V.[B' (a — »)| = V.[B'« -— B'p] =V.[B'a], (10.6) 


i.e., the variance of the low-dimensional code does not depend on the 
mean of the data. Therefore, we assume without loss of generality that the 
data has mean O for the remainder of this section. With this assumption 
the mean of the low-dimensional code is also 0 since E,[z] = E,[B' a] = 
B'E,[a] = 0. 
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10.2.1 Direction with Maximal Variance 


We maximize the variance of the low-dimensional code using a sequential 
approach. We start by seeking a single vector b, € R? that maximizes the 
variance of the projected data, i.e., we aim to maximize the variance of 
the first coordinate z, of z € R™ so that 


N 


1 
VY, := Viz) = le 


n=1 


(10.7) 


is maximized, where we exploited the i.i.d. assumption of the data and 
defined zın as the first coordinate of the low-dimensional representation 
Zn € R™ of x, € R”. Note that first component of z,, is given by 


Zin = bi £n, (10.8) 
i.e., it is the coordinate of the orthogonal projection of x,, onto the one- 
dimensional subspace spanned by b, (Section 3.8). We substitute (10.8) 
into (10.7), which yields 


i= igen 
V, = a db Bn)? =e Db entry bs (10.9a) 
LS 
= bi x x Ene, b, = bi Sb 5 (10.9b) 
n=1 


where S is the data covariance matrix defined in (10.1). In (10.9a), we 
have used the fact that the dot product of two vectors is symmetric with 
respect to its arguments, that is, b] £n = a by. 

Notice that arbitrarily increasing the magnitude of the vector bı in- 
creases V,, that is, a vector b, that is two times longer can result in V; 
that is potentially four times larger. Therefore, we restrict all solutions to 
\|b; ||” = 1, which results in a constrained optimization problem in which 
we seek the direction along which the data varies most. 

With the restriction of the solution space to unit vectors the vector bı 


that points in the direction of maximum variance can be found by the 
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constrained optimization problem 


max b| Sb; 


A (10.10) 

subject to ||b|| = 1. 

Following Section 7.2, we obtain the Lagrangian 
L(bi, À) = b] Sbi + à (1 — b} bi) (10.11) 


to solve this constrained optimization problem. The partial derivatives of 
£ with respect to b, and À; are 


A = 2b] S — 2b} , = =1-b/b,, (10.12) 
respectively. Setting these partial derivatives to 0 gives us the relations 

Sb; = 161, (10.13) 

bib, =1. (10.14) 


By comparing this with the definition of an eigenvalue decomposition 
(Section 4.4), we see that b, is an eigenvector of the data covariance 
matrix S, and the Lagrange multiplier A, plays the role of the correspond- 
ing eigenvalue. This eigenvector property (10.13) allows us to rewrite our 
variance objective (10.10) as 


V, = b] Sb; = 1b) by = AL, (10.15) 


i.e., the variance of the data projected onto a one-dimensional subspace 
equals the eigenvalue that is associated with the basis vector b; that spans 
this subspace. Therefore, to maximize the variance of the low-dimensional 
code, we choose the basis vector associated with the largest eigenvalue 
of the data covariance matrix. This eigenvector is called the first principal 
component. We can determine the effect/contribution of the principal com- 
ponent b; in the original data space by mapping the coordinate z,,, back 
into data space, which gives us the projected data point 


¥n = by 21n = bb) 2, € R? (10.16) 


in the original data space. 


Remark. Although z,, is a D-dimensional vector, it only requires a single 
coordinate zın to represent it with respect to the basis vector bı € RP. © 


10.2.2 M-dimensional Subspace with Maximal Variance 


Assume we have found the first m — 1 principal components as the m — 1 
eigenvectors of S that are associated with the largest m — 1 eigenvalues. 
Since S is symmetric, the spectral theorem (Theorem 4.15) states that we 
can use these eigenvectors to construct an orthonormal eigenbasis of an 
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(m — 1)-dimensional subspace of IR”. Generally, the mth principal com- 
ponent can be found by subtracting the effect of the first m — 1 principal 
components b,,...,6,, 1 from the data, thereby trying to find principal 
components that compress the remaining information. We then arrive at 
the new data matrix 


m—1 
X:=X-X bb; X=X-BmıX, (10.17) 


wl 


where X = [x,...,@y] € R?*% contains the data points as column 
vectors and B,,,_1 := So b;b; is a projection matrix that projects onto 
the subspace spanned by 6,,...,b,,_1.- 


Remark (Notation). Throughout this chapter, we do not follow the con- 
vention of collecting data 7,,...,a,) as the rows of the data matrix, but 
we define them to be the columns of X. This means that our data ma- 
trix X isa D x N matrix instead of the conventional N x D matrix. The 
reason for our choice is that the algebra operations work out smoothly 
without the need to either transpose the matrix or to redefine vectors as 
row vectors that are left-multiplied onto matrices. ro 


To find the mth principal component, we maximize the variance 
1x 1x 
2 = I Tos \2__ pT € 
Vr = Vien] = dF =H 2 Onan) = bp Sbm , (10.18) 


subject to ||b,,||° = 1, where we followed the same steps as in (10.9b) 
and defined S$ as the data covariance matrix of the transformed dataset 
X= {&1,...,%y}. As previously, when we looked at the first principal 
component alone, we solve a constrained optimization problem and dis- 
cover that the optimal solution b, is the eigenvector of $ that is associated 
with the largest eigenvalue of S. 

It turns out that b,,, is also an eigenvector of S. More generally, the sets 
of eigenvectors of S and Ô are identical. Since both S and $ are sym- 
metric, we can find an ONB of eigenvectors (spectral theorem 4.15), i.e., 
there exist D distinct eigenvectors for both S and S. Next, we show that 
every eigenvector of S is an eigenvector of S. Assume we have already 
found eigenvectors bi,...,Bm_1 of S. Consider an eigenvector b; of S, 
i.e., Sb; = A,b;. In general, 


; ie 1 
Sb; = XX 0; = G(X - Bm1 X)(X - Bm-1ıX)"b, (10.19a) 


= (S = SBm =. BnS + Bm-1ı8Bm-1)b: . (10.19b) 


We distinguish between two cases. If i > m, i.e., b; is an eigenvector 
that is not among the first m — 1 principal components, then b; is orthogo- 
nal to the first m— 1 principal components and Bm-1b; = 0. If i < m, i.e., 
b; is among the first m — 1 principal components, then b; is a basis vector 
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The matrix X := 
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maximal variance 
and the eigenvalue 
decomposition. We 
will revisit this 
connection in 
Section 10.4. 


Figure 10.5 
Properties of the 
training data of 
MNIST “8”. (a) 
Eigenvalues sorted 
in descending order; 
(b) Variance 
captured by the 
principal 
components 
associated with the 
largest eigenvalues. 
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of the principal subspace onto which B,,,_; projects. Since b,,..., Bm—1 
are an ONB of this principal subspace, we obtain B,,,_,b; = b;. The two 
cases can be summarized as follows: 


B16; = b; Bm-1bi =0 (10.20) 


In the case i > m, by using (10.20) in (10.19b), we obtain Sb; = (S — 
B,,_,S)b; = Sb; = »;b;, i-e., b; is also an eigenvector of S with eigen- 
value ;. Specifically, 


ifi<m, ifi>m. 


Sb,, = SBm = Ambm - (10.21) 


Equation (10.21) reveals that 6,,, is not only an eigenvector of S but also 
of S. Specifically, Am is the largest eigenvalue of S and Am is the mth 
largest eigenvalue of S, and both have the associated eigenvector 5,,,. 

In the case i < m, by using (10.20) in (10.19b), we obtain 


Sb; = (S — SBn_1 — Bm-1S + Bm-1SBm_1)b; = 0 = 0b; (10.22) 


This means that b,,...,6,,_, are also eigenvectors of S, but they are as- 
sociated with eigenvalue 0 so that b,,...,6,,_1 span the null space of S. 
Overall, every eigenvector of S is also an eigenvector of S. However, 
if the eigenvectors of S are part of the (m — 1) dimensional principal 
subspace, then the associated eigenvalue of S is 0. 
With the relation (10.21) and b, bm = 1, the variance of the data pro- 
jected onto the mth principal component is 


(10.21) 


Vn = b Sbm =" Amb, bm = Àm. (10.23) 


This means that the variance of the data, when projected onto an M- 
dimensional subspace, equals the sum of the eigenvalues that are associ- 
ated with the corresponding eigenvectors of the data covariance matrix. 


Example 10.2 (Eigenvalues of MNIST “8”) 
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(a) Eigenvalues (sorted in descending order) of (b) Variance captured by the principal compo- 
the data covariance matrix of all digits “8” in nents. 
the MNIST training set. 
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Taking all digits “8” in the MNIST training data, we compute the eigen- 
values of the data covariance matrix. Figure 10.5(a) shows the 200 largest 
eigenvalues of the data covariance matrix. We see that only a few of 
them have a value that differs significantly from 0. Therefore, most of 
the variance, when projecting data onto the subspace spanned by the cor- 
responding eigenvectors, is captured by only a few principal components, 
as shown in Figure 10.5(b). 


Overall, to find an M-dimensional subspace of R” that retains as much 
information as possible, PCA tells us to choose the columns of the matrix 
B in (10.3) as the M eigenvectors of the data covariance matrix S that 
are associated with the M largest eigenvalues. The maximum amount of 
variance PCA can capture with the first W principal components is 


M 
V= Y A (10.24) 
m=1 
where the An are the M largest eigenvalues of the data covariance matrix 
S. Consequently, the variance lost by data compression via PCA is 


D 
Jm = x Aj = Vp = Vm . (10.25) 


j=M+1 


Instead of these absolute quantities, we can define the relative variance 


captured as ⁄⁄4, and the relative variance lost by compression as 1 — “, 
Vp Vp 


10.3 Projection Perspective 


In the following, we will derive PCA as an algorithm that directly mini- 
mizes the average reconstruction error. This perspective allows us to in- 
terpret PCA as implementing an optimal linear auto-encoder. We will draw 
heavily from Chapters 2 and 3. 

In the previous section, we derived PCA by maximizing the variance 
in the projected space to retain as much information as possible. In the 
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Figure 10.6 
Illustration of the 
projection 
approach: Find a 
subspace (line) that 
minimizes the 
length of the 
difference vector 
between projected 
(orange) and 
original (blue) data. 


Figure 10.7 
Simplified 
projection setting. 
(a) A vector x € R? 
(red cross) shall be 
projected onto a 
one-dimensional 
subspace U C R? 
spanned by b. (b) 
shows the difference 
vectors between œ 
and some 
candidates x. 


Vectors % € U could 
be vectors on a 
plane in R3. The 
dimensionality of 
the plane is 2, but 
the vectors still have 
three coordinates 
with respect to the 
standard basis of 
R8. 
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(a) Setting. (b) Differences x — x; for 50 different x; are 
shown by the red lines. 


following, we will look at the difference vectors between the original data 
£n and their reconstruction x,, and minimize this distance so that x,, and 
Z,, are as close as possible. Figure 10.6 illustrates this setting. 


10.3.1 Setting and Objective 


Assume an (ordered) orthonormal basis (ONB) B = (b,,...,bp) of RP, 
i.e., b; b; = 1 if and only if i = j and 0 otherwise. 

From Section 2.5 we know that for a basis (b,,...,bp) of R? any x € 
R? can be written as a linear combination of the basis vectors of R”, i.e., 


D M D 
z = = Çaba = 5 mbm + 5 (75; (10.26) 
d= 


m=1 j=M+1 


for suitable coordinates ¢, € R. 
We are interested in finding vectors & € R”, which live in lower- 
dimensional subspace U C R”, dim(U) = M, so that 


M 
= S0 embm €U CR? (10.27) 
m=1 


is as similar to a as possible. Note that at this point we need to assume 
that the coordinates zm of x and Cm of x are not identical. 

In the following, we use exactly this kind of representation of x to find 
optimal coordinates z and basis vectors b,,..., bj, such that % is as sim- 
ilar to the original data point æ as possible, i.e., we aim to minimize the 
(Euclidean) distance ||a — x||. Figure 10.7 illustrates this setting. 

Without loss of generality, we assume that the dataset ¥ = {x,..., an}, 
x, € RP, is centered at 0, i.e., E[X] = 0. Without the zero-mean assump- 
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tion, we would arrive at exactly the same solution, but the notation would 
be substantially more cluttered. 
We are interested in finding the best linear projection of X onto a lower- 
dimensional subspace U of R? with dim(U) = M and orthonormal basis 
vectors b1,...,bm. We will call this subspace U the principal subspace. principal subspace 
The projections of the data points are denoted by 


M 
ba Y. aena Bare R (10.28) 

m=1 
where zn := [zin,---,ZmMn| € R* is the coordinate vector of 2, with 
respect to the basis (b;,...,b;,). More specifically, we are interested in 


having the z,, as similar to x, as possible. 
The similarity measure we use in the following is the squared distance 
(Euclidean norm) ||a — &||? between x and &. We therefore define our ob- 
jective as minimizing the average squared Euclidean distance (reconstruction reconstruction error 
error) (Pearson, 1901) 


eee i 
Ju = 5 So |latn — &nll?, (10.29) 
n=1 


where we make it explicit that the dimension of the subspace onto which 
we project the data is M. In order to find this optimal linear projection, 
we need to find the orthonormal basis of the principal subspace and the 
coordinates z,, € R™ of the projections with respect to this basis. 

To find the coordinates z,, and the ONB of the principal subspace, we 
follow a two-step approach. First, we optimize the coordinates z,, for a 
given ONB (b;,...,6,,); second, we find the optimal ONB. 


10.3.2 Finding Optimal Coordinates 


Let us start by finding the optimal coordinates z,,,,..., Z,7, of the projec- 
tions z,, forn = 1,...,.N. Consider Figure 10.7(b), where the principal 
subspace is spanned by a single vector b. Geometrically speaking, finding 
the optimal coordinates z corresponds to finding the representation of the 
linear projection x with respect to b that minimizes the distance between 
x — x. From Figure 10.7(b), it is clear that this will be the orthogonal 
projection, and in the following we will show exactly this. 

We assume an ONB (bı,...,bm) of U C RP. To find the optimal co- 
ordinates z,, with respect to this basis, we require the partial derivatives 


ƏJu OIny On 














Ozin Öf, Ozn’ ete) 
ðJ 2 
aes 7 S (En eek. (10.30b) 
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Figure 10.8 
Optimal projection 
of a vector æ € R? 
onto a 
one-dimensional 
subspace 
(continuation from 
Figure 10.7). 

(a) Distances 

||a — &|| for some 
ZEU. 

(b) Orthogonal 
projection and 


optimal coordinates. 


The coordinates of 
the optimal 
projection of £n 
with respect to the 
basis vectors 
bı,..., bm are the 
coordinates of the 
orthogonal 
projection of £n 
onto the principal 
subspace. 
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(a) Distances ||a — &|| for some & = z1b € 
U = span|[b]; see panel (b) for the setting. 


(b) The vector & that minimizes the distance 
in panel (a) is its orthogonal projection onto 
U. The coordinate of the projection # with 
respect to the basis vector b that spans U 
is the factor we need to scale b in order to 
“reach” &. 








Pa gga D Zmnbm | = b; (10.300) 
Ozin Ozin es mnYm a . 
for i = 1,..., M, such that we obtain 
T 
ODM TEA 2 -aTa (10.28) 2 = 
= —— (Zr =X Pa) b; = ~~ Ae Ly — by Zmndm b; 
OZin N N ma 
(10.31a) 
2 2 


N N 


since b, b; = 1. Setting this partial derivative to 0 yields immediately the 
optimal coordinates 


Zin = £, bi = b] £n (10.32) 


for i = 1,...,M and n = 1,...,N. This means that the optimal co- 
ordinates z;,, of the projection x, are the coordinates of the orthogonal 
projection (see Section 3.8) of the original data point x, onto the one- 
dimensional subspace that is spanned by b;. Consequently: 


= The optimal linear projection z,, of x, is an orthogonal projection. 

= The coordinates of Z,, with respect to the basis (b,,...,bj,,) are the 
coordinates of the orthogonal projection of x,, onto the principal sub- 
space. 

= An orthogonal projection is the best linear mapping given the objec- 
tive (10.29). 

= The coordinates ¢,,, of x in (10.26) and the coordinates z,, of z in (10.27) 
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must be identical for m = 1,..., M since U+ = span[byj41,..-, bp] is 
the orthogonal complement (see Section 3.6) of U = spanfb:,... , bu]. 


Remark (Orthogonal Projections with Orthonormal Basis Vectors). Let us 


briefly recap orthogonal projections from Section 3.8. If (b4, ..., bp) is an 
orthonormal basis of R? then 
& = b;(b; bj) b; x = bjbj z € R? (10.33) 


is the orthogonal projection of a onto the subspace spanned by the jth ba- 
sis vector, and zj = bj x is the coordinate of this projection with respect to 
the basis vector b; that spans that subspace since z;b; = x. Figure 10.8(b) 
illustrates this setting. 

More generally, if we aim to project onto an M-dimensional subspace 
of R”, we obtain the orthogonal projection of æ onto the M-dimensional 


subspace with orthonormal basis vectors b;,..., bay as 
« — B(B'B)'B'«=BB'z, (10.34) 
=I 
where we defined B := [bi,..., bm] € R?*™. The coordinates of this 
projection with respect to the ordered basis (b;,...,b,,) are z := B'ax 


as discussed in Section 3.8. 
We can think of the coordinates as a representation of the projected 


vector in a new coordinate system defined by (b,,...,by,). Note that al- 
though « € R”, we only need M coordinates 2,,...,z,, to represent 
this vector; the other D — M coordinates with respect to the basis vectors 
(byp41,---, 6p) are always 0. ©% 


So far we have shown that for a given ONB we can find the optimal 
coordinates of x by an orthogonal projection onto the principal subspace. 
In the following, we will determine what the best basis is. 


10.3.3 Finding the Basis of the Principal Subspace 


To determine the basis vectors b,,...,b,,; of the principal subspace, we 
rephrase the loss function (10.29) using the results we have so far. This 
will make it easier to find the basis vectors. To reformulate the loss func- 
tion, we exploit our results from before and obtain 


M M 
Ln = 5 oh Ue) S25 (@) Bin) Bm - (10.35) 
m=1 m=1 


We now exploit the symmetry of the dot product, which yields 
M 
Pi (£ babi) Trs (10.36) 
m=1 
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bj x is the 
coordinate of the 
orthogonal 
projection of a onto 
the subspace 
spanned by bj. 
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Figure 10.9 
Orthogonal 
projection and 
displacement 
vectors. When 
projecting data 
points £n (blue) 
onto subspace U4, 
we obtain Zn 
(orange). The 
displacement vector 
Ln — Ly lies 
completely in the 
orthogonal 
complement U2 of 
Ui. 
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Since we can generally write the original data point æ,„ as a linear combi- 
nation of all basis vectors, it holds that 


D D D 
Ly = 5 Ždnba — So (a) ba)ba = `> bab, Ln (10.37a) 
d=1 d=1 d=1 


M D 
So Ombn | ant ( XO bjb; |En, (10.37b) 
m=1 


j=M+41 


where we split the sum with D terms into a sum over M and a sum 
over D — M terms. With this result, we find that the displacement vector 
Zn — Lp, i.e., the difference vector between the original data point and its 
projection, is 


D 
n= = | >> bb; | an (10.38a) 
j=M+1 
D 
= y (eb: (10.38b) 
j=M+41 


This means the difference is exactly the projection of the data point onto 
the orthogonal complement of the principal subspace: We identify the ma- 
trix ee m1 93 b; in (10.38a) as the projection matrix that performs this 
projection. Hence the displacement vector x,, — &ņ„ lies in the subspace 
that is orthogonal to the principal subspace as illustrated in Figure 10.9. 


Remark (Low-Rank Approximation). In (10.38a), we saw that the projec- 
tion matrix, which projects x onto &, is given by 


M 
So bnb,, = BB". (10.39) 


m=1 


‘ : T T 
By construction as a sum of rank-one matrices bmb,„ we see that BB is 
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symmetric and has rank M. Therefore, the average squared reconstruction 
error can also be written as 














isi ig = 1S |e _BB'x (10.40a) 
N n=1 N n=1 
1 Š 7 
==> |a - BB')x (10.40b) 
n=1 
Finding orthonormal basis vectors b,,..., bj;, which minimize the differ- 


ence between the original data x,, and their projections z,,, is equivalent 

to finding the best rank- approximation BB' of the identity matrix I 

(see Section 4.6). © 
Now we have all the tools to reformulate the loss function (10.29). 


2 














1 D 
be #,|? “=” D3 X E zn)b; (10.41) 
n=1 ||j=M+41 


We now explicitly compute the squared norm and exploit the fact that the 
b; form an ONB, which yields 


1 


ines so (bj £)? = aS y b; Enbj Ln (10.42a) 
n=1 j=M+1 n=l j=M+1 
1 N D 
=5 5y bj x, b; , (10.42b) 
n=1j=M+1 





where we exploited the symmetry of the dot product in the last step to 
write bj xy, = x bj. We now swap the sums and obtain 


D 1x D 
Iu= J, b; (Zeer) b;= Š bjsSb; (10.43a) 
j=M+1 N n=1 j=M+1 
a 
D D 
= Ý abso) = Ý t(Sb;bj)=tr( ( 5 bjb} ) 5), 
J=M+1 j=M+1 j=M+1 


projection matrix 


(10.43b) 


where we exploited the property that the trace operator tr(-) (see (4.18)) 
is linear and invariant to cyclic permutations of its arguments. Since we 
assumed that our dataset is centered, i.e., E[A’] = 0, we identify S as the 
data covariance matrix. Since the projection matrix in (10.43b) is con- 
structed as a sum of rank-one matrices b; b. it itself is of rank D — M. 
Equation (10.43a) implies that we can formulate the average squared 
reconstruction error equivalently as the covariance matrix of the data, 
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PCA finds the best 
rank-M 
approximation of 
the identity matrix. 


Minimizing the 
average squared 
reconstruction error 
is equivalent to 
minimizing the 
projection of the 
data covariance 
matrix onto the 
orthogonal 
complement of the 
principal subspace. 
Minimizing the 
average squared 
reconstruction error 
is equivalent to 
maximizing the 
variance of the 
projected data. 


Figure 10.10 
Embedding of 
MNIST digits 0 
(blue) and 1 
(orange) ina 
two-dimensional 
principal subspace 
using PCA. Four 
embeddings of the 
digits “O” and “1” in 
the principal 
subspace are 
highlighted in red 
with their 
corresponding 
original digit. 
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projected onto the orthogonal complement of the principal subspace. Min- 
imizing the average squared reconstruction error is therefore equivalent to 
minimizing the variance of the data when projected onto the subspace we 
ignore, i.e., the orthogonal complement of the principal subspace. Equiva- 
lently, we maximize the variance of the projection that we retain in the 
principal subspace, which links the projection loss immediately to the 
maximum-variance formulation of PCA discussed in Section 10.2. But this 
then also means that we will obtain the same solution that we obtained 
for the maximum-variance perspective. Therefore, we omit a derivation 
that is identical to the one presented in Section 10.2 and summarize the 
results from earlier in the light of the projection perspective. 

The average squared reconstruction error, when projecting onto the M- 
dimensional principal subspace, is 


D 
Jm = > Àj, 


j=M+1 


(10.44) 


where à; are the eigenvalues of the data covariance matrix. Therefore, 
to minimize (10.44) we need to select the smallest D — M eigenvalues, 
which then implies that their corresponding eigenvectors are the basis of 
the orthogonal complement of the principal subspace. Consequently, this 
means that the basis of the principal subspace comprises the eigenvectors 
bı, ...,bm that are associated with the largest M eigenvalues of the data 
covariance matrix. 


Example 10.3 (MNIST Digits Embedding) 





Figure 10.10 visualizes the training data of the MMIST digits “0” and “1” 
embedded in the vector subspace spanned by the first two principal com- 
ponents. We observe a relatively clear separation between “0”s (blue dots) 
and “1”s (orange dots), and we see the variation within each individual 
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cluster. Four embeddings of the digits “O” and “1” in the principal subspace 
are highlighted in red with their corresponding original digit. The figure 
reveals that the variation within the set of “O” is significantly greater than 
the variation within the set of “1”. 


10.4 Eigenvector Computation and Low-Rank Approximations 


In the previous sections, we obtained the basis of the principal subspace 
as the eigenvectors that are associated with the largest eigenvalues of the 
data covariance matrix 


Š T 1 T 
Sa 2 Enn sn n (10.45) 


X = |æ;,... £y] E RPN. (10.46) 


Note that X isa D x N matrix, i.e., it is the transpose of the “typical” 
data matrix (Bishop, 2006; Murphy, 2012). To get the eigenvalues (and 
the corresponding eigenvectors) of S, we can follow two approaches: 


« We perform an eigendecomposition (see Section 4.2) and compute the 
eigenvalues and eigenvectors of S directly. 

= We use a singular value decomposition (see Section 4.5). Since S is 
symmetric and factorizes into X X ' (ignoring the factor +), the eigen- 
values of S are the squared singular values of X. 


More specifically, the SVD of X is given by 


X=U eV (10.47) 
We eww 
DxN DxD DxN NxN 
where U € R?*? and V' € RY*N are orthogonal matrices and © € 
R?*N is a matrix whose only nonzero entries are the singular values o;; > 
0. It then follows that 
1 1 1 
S=>XX' = -USV VX'U' =—UÐX'U'. 10.48 
N N ~ N (10.48) 
With the results from Section 4.5, we get that the columns of U are the 
eigenvectors of X X' (and therefore S). Furthermore, the eigenvalues 
Aq of S are related to the singular values of X via 
2 
o 
r= T i (10.49) 


This relationship between the eigenvalues of S and the singular values 
of X provides the connection between the maximum variance view (Sec- 
tion 10.2) and the singular value decomposition. 
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10.4.1 PCA Using Low-Rank Matrix Approximations 


To maximize the variance of the projected data (or minimize the average 
squared reconstruction error), PCA chooses the columns of U in (10.48) 
to be the eigenvectors that are associated with the M largest eigenvalues 
of the data covariance matrix S'so that we identify U as the projection ma- 
trix B in (10.3), which projects the original data onto a lower-dimensional 
subspace of dimension M. The Eckart-Young theorem (Theorem 4.25 in 
Section 4.6) offers a direct way to estimate the low-dimensional represen- 
tation. Consider the best rank-W approximation 


Xu argmin,(a)<m ||X — All, € R?*™ (10.50) 


of X, where ||-||,, is the spectral norm defined in (4.93). The Eckart-Young 


theorem states that Xj, is given by truncating the SVD at the top-M 
singular value. In other words, we obtain 


Xu = Un Xm Vi E Roe (10.51) 

“YH” “a” “MY 

DxM MxM MxN 
with orthogonal matrices Um := [u1,... um] E€ R?*™” and Vy := 
[vi,... 0m] € RN*™ and a diagonal matrix Sı € R¥*™M whose diago- 


nal entries are the M largest singular values of X. 


10.4.2 Practical Aspects 


Finding eigenvalues and eigenvectors is also important in other funda- 
mental machine learning methods that require matrix decompositions. In 
theory, as we discussed in Section 4.2, we can solve for the eigenvalues as 
roots of the characteristic polynomial. However, for matrices larger than 
4x 4 this is not possible because we would need to find the roots of a poly- 
nomial of degree 5 or higher. However, the Abel-Ruffini theorem (Ruffini, 
1799; Abel, 1826) states that there exists no algebraic solution to this 
problem for polynomials of degree 5 or more. Therefore, in practice, we 
solve for eigenvalues or singular values using iterative methods, which are 
implemented in all modern packages for linear algebra. 

In many applications (such as PCA presented in this chapter), we only 
require a few eigenvectors. It would be wasteful to compute the full de- 
composition, and then discard all eigenvectors with eigenvalues that are 
beyond the first few. It turns out that if we are interested in only the first 
few eigenvectors (with the largest eigenvalues), then iterative processes, 
which directly optimize these eigenvectors, are computationally more effi- 
cient than a full eigendecomposition (or SVD). In the extreme case of only 
needing the first eigenvector, a simple method called the power iteration 
is very efficient. Power iteration chooses a random vector £o that is not in 
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the null space of S and follows the iteration 


Sx, 
x Sie SO Dea 10.52 
Me [Sze Am ma 
This means the vector x; is multiplied by S in every iteration and then 
normalized, i.e., we always have ||x;|| = 1. This sequence of vectors con- 


verges to the eigenvector associated with the largest eigenvalue of S. The 
original Google PageRank algorithm (Page et al., 1999) uses such an al- 
gorithm for ranking web pages based on their hyperlinks. 


10.5 PCA in High Dimensions 


In order to do PCA, we need to compute the data covariance matrix. In D 
dimensions, the data covariance matrix is a D x D matrix. Computing the 
eigenvalues and eigenvectors of this matrix is computationally expensive 
as it scales cubically in D. Therefore, PCA, as we discussed earlier, will be 
infeasible in very high dimensions. For example, if our æ„ are images with 
10,000 pixels (e.g., 100 x 100 pixel images), we would need to compute 
the eigendecomposition of a 10,000 x 10,000 covariance matrix. In the 
following, we provide a solution to this problem for the case that we have 
substantially fewer data points than dimensions, i.e.. N << D. 

Assume we have a centered dataset £1,..., €N, £n E€ RP. Then the 
data covariance matrix is given as 


1 
S= XX eRe? (10.53) 


where X = [a,...,@y] is a D x N matrix whose columns are the data 
points. 

We now assume that NV < D, i.e., the number of data points is smaller 
than the dimensionality of the data. If there are no duplicate data points, 
the rank of the covariance matrix S is NV, so it has D— N +1 many eigen- 
values that are 0. Intuitively, this means that there are some redundancies. 
In the following, we will exploit this and turn the D x D covariance matrix 
into an N x N covariance matrix whose eigenvalues are all positive. 

In PCA, we ended up with the eigenvector equation 


Sbm = Xmbm, m=1,...,M, (10.54) 


where b, is a basis vector of the principal subspace. Let us rewrite this 
equation a bit: With S defined in (10.53), we obtain 


1 
Sbm = VAX bm = ÀAmbm . (10.55) 
We now multiply X' € R^*? from the left-hand side, which yields 
1 


1 
XTX X' bn =AmX Wm —> —X' Ken =AmCm, (10.56) 
NWT J N 


NxN =:Cm 


©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020). 


If S is invertible, it 
is sufficient to 
ensure that xo Æ 0. 


standardization 
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and we get a new eigenvector/eigenvalue equation: Am remains eigen- 
value, which confirms our results from Section 4.5.3 that the nonzero 
eigenvalues of X X' equal the nonzero eigenvalues of X ' X. We obtain 
the eigenvector of the matrix +X 'X e€ RY* associated with A, as 
Cm = X'b,,. Assuming we have no duplicate data points, this matrix 
has rank N and is invertible. This also implies that wx T X has the same 
(nonzero) eigenvalues as the data covariance matrix S. But this is now an 
N x N matrix, so that we can compute the eigenvalues and eigenvectors 
much more efficiently than for the original D x D data covariance matrix. 

Now that we have the eigenvectors of aX 1X, we are going to re- 
cover the original eigenvectors, which we still need for PCA. Currently, 
we know the eigenvectors of >X T X. If we left-multiply our eigenvalue/ 
eigenvector equation with X, we get 


1 
wAY X cm = ÀAmX Cm (10.57) 


= 
S 


and we recover the data covariance matrix again. This now also means 
that we recover Xc,,, as an eigenvector of S. 


Remark. If we want to apply the PCA algorithm that we discussed in Sec- 
tion 10.6, we need to normalize the eigenvectors X €m of S so that they 
have norm 1. © 


10.6 Key Steps of PCA in Practice 


In the following, we will go through the individual steps of PCA using a 
running example, which is summarized in Figure 10.11. We are given a 
two-dimensional dataset (Figure 10.11(a)), and we want to use PCA to 
project it onto a one-dimensional subspace. 


1. Mean subtraction We start by centering the data by computing the 
mean yp of the dataset and subtracting it from every single data point. 
This ensures that the dataset has mean O (Figure 10.11(b)). Mean sub- 
traction is not strictly necessary but reduces the risk of numerical prob- 
lems. 

2. Standardization Divide the data points by the standard deviation o, 
of the dataset for every dimension d = 1,..., D. Now the data is unit 
free, and it has variance 1 along each axis, which is indicated by the 
two arrows in Figure 10.11(c). This step completes the standardization 
of the data. 

3. Eigendecomposition of the covariance matrix Compute the data 
covariance matrix and its eigenvalues and corresponding eigenvectors. 
Since the covariance matrix is symmetric, the spectral theorem (The- 
orem 4.15) states that we can find an ONB of eigenvectors. In Fig- 
ure 10.11(d), the eigenvectors are scaled by the magnitude of the cor- 


Draft (2021-01-14) of “Mathematics for Machine Learning”. Feedback: https: //mml-book.com. 


10.6 Key Steps of PCA in Practice 














a 


ti 


(a) Original dataset. 
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(d) Step 3: Compute eigenval- 
ues and eigenvectors (arrows) 
of the data covariance matrix 
(ellipse). 
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(b) Step 1: Centering by sub- 
tracting the mean from each 
data point. 
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(e) Step 4: Project data onto 
the principal subspace. 
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Tı 
(c) Step 2: Dividing by the 
standard deviation to make 
the data unit free. Data has 
variance 1 along each azis. 














0 5 
zı 
(f) Undo the standardization 
and move projected data back 
into the original data space 
from (a). 


responding eigenvalue. The longer vector spans the principal subspace, 
which we denote by U. The data covariance matrix is represented by 
the ellipse. 

. Projection We can project any data point x, € R? onto the principal 
subspace: To get this right, we need to standardize x, using the mean 
[tq and standard deviation o, of the training data in the dth dimension, 
respectively, so that 


as Ha 


rO e Z > d=1,...,D, (10.58) 
Od 
where x is the dth component of x,. We obtain the projection as 
&, = BB'x, (10.59) 
with coordinates 
z, = B' z, (10.60) 


with respect to the basis of the principal subspace. Here, B is the ma- 
trix that contains the eigenvectors that are associated with the largest 
eigenvalues of the data covariance matrix as columns. PCA returns the 
coordinates (10.60), not the projections æ. 
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Figure 10.11 Steps 
of PCA. (a) Original 
dataset; 

(b) centering; 

(c) divide by 
standard deviation; 
(d) eigendecomposi- 
tion; (e) projection; 
(f) mapping back to 
original data space. 


Figure 10.12 Effect 
of increasing the 
number of principal 
components on 
reconstruction. 
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Having standardized our dataset, (10.59) only yields the projections in 
the context of the standardized dataset. To obtain our projection in the 
original data space (i.e., before standardization), we need to undo the 
standardization (10.58) and multiply by the standard deviation before 
adding the mean so that we obtain 


EO —#%g+ pag, d=1,...,D. (10.61) 


Figure 10.11(f) illustrates the projection in the original data space. 


Example 10.4 (MNIST Digits: Reconstruction) 

In the following, we will apply PCA to the MNIST digits dataset, which 
contains 60,000 examples of handwritten digits 0 through 9. Each digit is 
an image of size 28 x 28, i.e., it contains 784 pixels so that we can interpret 
every image in this dataset as a vector x € R’*+. Examples of these digits 
are shown in Figure 10.3. 


HARA 
HHA 

HEA 
HARA: 
HARA. 


For illustration purposes, we apply PCA to a subset of the MNIST digits, 
and we focus on the digit “8”. We used 5,389 training images of the digit 
“8” and determined the principal subspace as detailed in this chapter. We 
then used the learned projection matrix to reconstruct a set of test im- 
ages, which is illustrated in Figure 10.12. The first row of Figure 10.12 
shows a set of four original digits from the test set. The following rows 
show reconstructions of exactly these digits when using a principal sub- 
space of dimensions 1, 10, 100, and 500, respectively. We see that even 
with a single-dimensional principal subspace we get a halfway decent re- 
construction of the original digits, which, however, is blurry and generic. 
With an increasing number of principal components (PCs), the reconstruc- 
tions become sharper and more details are accounted for. With 500 prin- 
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cipal components, we effectively obtain a near-perfect reconstruction. If 
we were to choose 784 PCs, we would recover the exact digit without any 
compression loss. 

Figure 10.13 shows the average squared reconstruction error, which is 


1 N D 
y 2 llen - čal? = Sa (10.62) 
n=1 


i=M+1 


as a function of the number M of principal components. We can see that 
the importance of the principal components drops off rapidly, and only 
marginal gains can be achieved by adding more PCs. This matches exactly 
our observation in Figure 10.5, where we discovered that the most of the 
variance of the projected data is captured by only a few principal compo- 
nents. With about 550 PCs, we can essentially fully reconstruct the training 
data that contains the digit “8” (some pixels around the boundaries show 
no variation across the dataset as they are always black). 














0 200 400 600 800 
Number of PCs 


10.7 Latent Variable Perspective 


In the previous sections, we derived PCA without any notion of a prob- 
abilistic model using the maximum-variance and the projection perspec- 
tives. On the one hand, this approach may be appealing as it allows us to 
sidestep all the mathematical difficulties that come with probability the- 
ory, but on the other hand, a probabilistic model would offer us more flex- 
ibility and useful insights. More specifically, a probabilistic model would 


= Come with a likelihood function, and we can explicitly deal with noisy 
observations (which we did not even discuss earlier) 

= Allow us to do Bayesian model comparison via the marginal likelihood 
as discussed in Section 8.6 

a View PCA as a generative model, which allows us to simulate new data 
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Figure 10.13 
Average squared 
reconstruction error 
as a function of the 
number of principal 
components. The 
average squared 
reconstruction error 
is the sum of the 
eigenvalues in the 
orthogonal 
complement of the 
principal subspace. 


probabilistic PCA 
PPCA 


ancestral sampling 
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= Allow us to make straightforward connections to related algorithms 

= Deal with data dimensions that are missing at random by applying 
Bayes’ theorem 

= Give us a notion of the novelty of a new data point 

= Give us a principled way to extend the model, e.g., to a mixture of PCA 
models 

= Have the PCA we derived in earlier sections as a special case 

= Allow for a fully Bayesian treatment by marginalizing out the model 
parameters 


By introducing a continuous-valued latent variable z € R” it is possible 
to phrase PCA as a probabilistic latent-variable model. Tipping and Bishop 
(1999) proposed this latent-variable model as probabilistic PCA (PPCA). 
PPCA addresses most of the aforementioned issues, and the PCA solution 
that we obtained by maximizing the variance in the projected space or 
by minimizing the reconstruction error is obtained as the special case of 
maximum likelihood estimation in a noise-free setting. 


10.7.1 Generative Process and Probabilistic Model 


In PPCA, we explicitly write down the probabilistic model for linear di- 
mensionality reduction. For this we assume a continuous latent variable 
z € R™ with a standard-normal prior p(z) = M (0, T) and a linear rela- 
tionship between the latent variables and the observed a data where 


x= Bztpt+eecR?’, (10.63) 


where e ~ N (0, o7I) is Gaussian observation noise and B € R?*™” 
and u € R?” describe the linear/affine mapping from latent to observed 
variables. Therefore, PPCA links latent and observed variables via 


p(a|z, B, p,07) = N(x|Bz+yp, 071). (10.64) 

Overall, PPCA induces the following generative process: 
Zn ~N (z]|0, T) (10.65) 
Ln | Zn ~ N(x |Bz, +p, oT) (10.66) 


To generate a data point that is typical given the model parameters, we 
follow an ancestral sampling scheme: We first sample a latent variable z,, 
from p(z). Then we use z,, in (10.64) to sample a data point conditioned 
on the sampled z,,, i.e., 2, ~ p(x | zn, B, p, 0°). 

This generative process allows us to write down the probabilistic model 
(i.e., the joint distribution of all random variables; see Section 8.4) as 


p(, 2|B, w,0°) = p(x|z, B, u,0°)p(z), (10.67) 
which immediately gives rise to the graphical model in Figure 10.14 using 


the results from Section 8.5. 
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or 


Remark. Note the direction of the arrow that connects the latent variables 
z and the observed data æ: The arrow points from z to x, which means 
that the PPCA model assumes a lower-dimensional latent cause z for high- 
dimensional observations x. In the end, we are obviously interested in 
finding something out about z given some observations. To get there we 
will apply Bayesian inference to “invert” the arrow implicitly and go from 
observations to latent variables. © 


Example 10.5 (Generating New Data Using Latent Variables) 





Figure 10.15 shows the latent coordinates of the MNIST digits “8” found 
by PCA when using a two-dimensional principal subspace (blue dots). 
We can query any vector z, in this latent space and generate an image 
x, = Bz, that resembles the digit “8”. We show eight of such generated 
images with their corresponding latent space representation. Depending 
on where we query the latent space, the generated images look different 
(shape, rotation, size, etc.). If we query away from the training data, we 
see more an more artifacts, e.g., the top-left and top-right digits. Note that 
the intrinsic dimensionality of these generated images is only two. 
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Figure 10.14 
Graphical model for 
probabilistic PCA. 
The observations xn, 
explicitly depend on 
corresponding 
latent variables 

zn ~ N (0, I). The 
model parameters 
B, u and the 
likelihood 
parameter o are 
shared across the 
dataset. 


Figure 10.15 
Generating new 
MNIST digits. The 
latent variables z 
can be used to 
generate new data 
x = Bz. The closer 
we stay to the 
training data, the 
more realistic the 
generated data. 


The likelihood does 
not depend on the 
latent variables z. 
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10.7.2 Likelihood and Joint Distribution 


Using the results from Chapter 6, we obtain the likelihood of this proba- 
bilistic model by integrating out the latent variable z (see Section 8.4.3) 
so that 


p(æ|B, p0?) = | p(æ|z, B, p,0?)p(z)dz (10.68a) 
= [xe |Bz +p, o°I)N(z|0, I)dz. — (10.68b) 


From Section 6.5, we know that the solution to this integral is a Gaussian 
distribution with mean 


E,[x] = E,[Bz + pw] + E,[e] = u (10.69) 
and with covariance matrix 


V[a] = V,[Bz + u] + V.[e] = V,[Bz] + 07°F (10.70a) 
= BV,{z|B' +0’ I = BB' +o°I. (10.70b) 


The likelihood in (10.68b) can be used for maximum likelihood or MAP 
estimation of the model parameters. 


Remark. We cannot use the conditional distribution in (10.64) for maxi- 
mum likelihood estimation as it still depends on the latent variables. The 
likelihood function we require for maximum likelihood (or MAP) estima- 
tion should only be a function of the data x and the model parameters, 
but must not depend on the latent variables. ro 


From Section 6.5, we know that a Gaussian random variable z and 
a linear/affine transformation « = Bz of it are jointly Gaussian dis- 
tributed. We already know the marginals p(z) = N (z |0, T) and p(x) = 


N(x | p, BB' + oI). The missing cross-covariance is given as 
Cov[x, z] = Cov,[Bz + u] = BCov,[z,z]=B. (10.71) 


Therefore, the probabilistic model of PPCA, i.e., the joint distribution of 
latent and observed random variables is explicitly given by 


ph ied 





with a mean vector of length D + M and a covariance matrix of size 
(D+ M)x(D+M). 


10.7.3 Posterior Distribution 


The joint Gaussian distribution p(æ, z | B, u, a°) in (10.72) allows us to 
determine the posterior distribution p(z | x) immediately by applying the 
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rules of Gaussian conditioning from Section 6.5.1. The posterior distribu- 
tion of the latent variable given an observation x is then 


p(z|x) =N(z|m, C), (10.73) 
m = B'(BB' +071) \(a—p), (10.74) 
C=I-B'(BB'+o°I)'B. (10.75) 


Note that the posterior covariance does not depend on the observed data 
a. For a new observation a, in data space, we use (10.73) to determine 
the posterior distribution of the corresponding latent variable z,. The co- 
variance matrix C allows us to assess how confident the embedding is. A 
covariance matrix C with a small determinant (which measures volumes) 
tells us that the latent embedding z, is fairly certain. If we obtain a pos- 
terior distribution p(z. | z,) with much variance, we may be faced with 
an outlier. However, we can explore this posterior distribution to under- 
stand what other data points x are plausible under this posterior. To do 
this, we exploit the generative process underlying PPCA, which allows us 
to explore the posterior distribution on the latent variables by generating 
new data that is plausible under this posterior: 


1. Sample a latent variable z, ~ p(z | £.) from the posterior distribution 
over the latent variables (10.73). 
2. Sample a reconstructed vector %, ~ p(æ | zx, B, p, o°) from (10.64). 


If we repeat this process many times, we can explore the posterior dis- 
tribution (10.73) on the latent variables z, and its implications on the 
observed data. The sampling process effectively hypothesizes data, which 
is plausible under the posterior distribution. 


10.8 Further Reading 


We derived PCA from two perspectives: (a) maximizing the variance in the 
projected space; (b) minimizing the average reconstruction error. How- 
ever, PCA can also be interpreted from different perspectives. Let us recap 
what we have done: We took high-dimensional data x € R? and used 
a matrix B' to find a lower-dimensional representation z € R™. The 
columns of B are the eigenvectors of the data covariance matrix S that are 
associated with the largest eigenvalues. Once we have a low-dimensional 
representation z, we can get a high-dimensional version of it (in the orig- 
inal data space) as x ~ & = Bz = BB'a« € RP, where BB’ isa 
projection matrix. 

We can also think of PCA as a linear auto-encoder as illustrated in Fig- 
ure 10.16. An auto-encoder encodes the data x,, € R? to a code z, € RM 
and decodes it to a &, similar to x,. The mapping from the data to the 
code is called the encoder, and the mapping from the code back to the orig- 
inal data space is called the decoder. If we consider linear mappings where 
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Figure 10.16 PCA 
can be viewed as a 
linear auto-encoder. 
It encodes the 
high-dimensional 
data x into a 
lower-dimensional 
representation 
(code) z € R™ and 
decodes z using a 
decoder. The 
decoded vector = is 
the orthogonal 
projection of the 
original data x onto 
the M-dimensional 
principal subspace. 


recognition network 


inference network 
generator 


The code is a 
compressed version 
of the original data. 
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Original 
Code 


M 
Bp FR B 
2 


Encoder Decoder 


the code is given by z,, = B' x, € R™ and we are interested in minimiz- 
ing the average squared error between the data z,, and its reconstruction 
z, = Bz,,n=1,...,N, we obtain 


i ihe 2 
zD len- žy? = zD læn - BB' z, (10.76) 
n=l n=1 








This means we end up with the same objective function as in (10.29) that 
we discussed in Section 10.3 so that we obtain the PCA solution when we 
minimize the squared auto-encoding loss. If we replace the linear map- 
ping of PCA with a nonlinear mapping, we get a nonlinear auto-encoder. 
A prominent example of this is a deep auto-encoder where the linear func- 
tions are replaced with deep neural networks. In this context, the encoder 
is also known as a recognition network or inference network, whereas the 
decoder is also called a generator. 

Another interpretation of PCA is related to information theory. We can 
think of the code as a smaller or compressed version of the original data 
point. When we reconstruct our original data using the code, we do not 
get the exact data point back, but a slightly distorted or noisy version 
of it. This means that our compression is “lossy”. Intuitively, we want 
to maximize the correlation between the original data and the lower- 
dimensional code. More formally, this is related to the mutual information. 
We would then get the same solution to PCA we discussed in Section 10.3 
by maximizing the mutual information, a core concept in information the- 
ory (MacKay, 2003). 

In our discussion on PPCA, we assumed that the parameters of the 
model, i.e., B, u, and the likelihood parameter co”, are known. Tipping 
and Bishop (1999) describe how to derive maximum likelihood estimates 
for these parameters in the PPCA setting (note that we use a different 
notation in this chapter). The maximum likelihood parameters, when pro- 
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jecting D-dimensional data onto an M-dimensional subspace, are 


1 N 
m= y 3 En, (10.77) 
Bu. = T(A—0?I)?R, (10.78) 
1 D 
o= ear pD Xo (10.79) 
j=M+1 


where T € R?*™ contains M eigenvectors of the data covariance matrix, The matrix A — 0?I 
A = diag(A1,...,Aw) € R“*™ is a diagonal matrix with the eigenvalues __ in (10.78) is 
associated with the principal axes on its diagonal, and R € RM%M is SU antegd to ber 
: i š A f : x positive semidefinite 
an arbitrary orthogonal matrix. The maximum likelihood solution Byy is ås the smallest 
unique up to an arbitrary orthogonal transformation, e.g., we can right- eigenvalue of the 
multiply By, with any rotation matrix R so that (10.78) essentially is a data covariance 
singular value decomposition (see Section 4.5). An outline of the proof is maX's bonaca 
j E A from below by the 

given by Tipping and Bishop (1999). horë tarine ts. 

The maximum likelihood estimate for u given in (10.77) is the sample 
mean of the data. The maximum likelihood estimator for the observation 
noise variance o” given in (10.79) is the average variance in the orthog- 
onal complement of the principal subspace, i.e., the average leftover vari- 
ance that we cannot capture with the first W principal components is 
treated as observation noise. 

In the noise-free limit where ø —> 0, PPCA and PCA provide identical 
solutions: Since the data covariance matrix S is symmetric, it can be di- 
agonalized (see Section 4.4), i.e., there exists a matrix T' of eigenvectors 
of S so that 


S=TAT'. (10.80) 


In the PPCA model, the data covariance matrix is the covariance matrix of 
the Gaussian likelihood p( | B, p, 07), which is BB ' +07, see (10.70b). 
For o — 0, we obtain BB' so that this data covariance must equal the 
PCA data covariance (and its factorization given in (10.80)) so that 


Cov[¥] = TAT |= BB" <— B=TA?R, (10.81) 


i.e., we obtain the maximum likelihood estimate in (10.78) for o = 0. 
From (10.78) and (10.80), it becomes clear that (P)PCA performs a de- 
composition of the data covariance matrix. 

In a streaming setting, where data arrives sequentially, it is recom- 
mended to use the iterative expectation maximization (EM) algorithm for 
maximum likelihood estimation (Roweis, 1998). 

To determine the dimensionality of the latent variables (the length of 
the code, the dimensionality of the lower-dimensional subspace onto which 
we project the data), Gavish and Donoho (2014) suggest the heuristic 
that, if we can estimate the noise variance g? of the data, we should 
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4ov D 





discard all singular values smaller than . Alternatively, we can use 
(nested) cross-validation (Section 8.6.1) or Bayesian model selection cri- 
teria (discussed in Section 8.6.2) to determine a good estimate of the 
intrinsic dimensionality of the data (Minka, 2001b). 


Similar to our discussion on linear regression in Chapter 9, we can place 
a prior distribution on the parameters of the model and integrate them 
out. By doing so, we (a) avoid point estimates of the parameters and the 
issues that come with these point estimates (see Section 8.6) and (b) al- 
low for an automatic selection of the appropriate dimensionality M of the 
latent space. In this Bayesian PCA, which was proposed by Bishop (1999), 
a prior p(u, B,o°) is placed on the model parameters. The generative 
process allows us to integrate the model parameters out instead of condi- 
tioning on them, which addresses overfitting issues. Since this integration 
is analytically intractable, Bishop (1999) proposes to use approximate in- 
ference methods, such as MCMC or variational inference. We refer to the 
work by Gilks et al. (1996) and Blei et al. (2017) for more details on these 
approximate inference techniques. 


In PPCA, we considered the linear model p(x, | zn) = N (£n | Bzn + 
u, œ’°I ) with prior p(z,) = N (0, I ), where all observation dimensions 
are affected by the same amount of noise. If we allow each observation 
dimension d to have a different variance o2, we obtain factor analysis 
(FA) (Spearman, 1904; Bartholomew et al., 2011). This means that FA 
gives the likelihood some more flexibility than PPCA, but still forces the 
data to be explained by the model parameters B, ps.However, FA no 
longer allows for a closed-form maximum likelihood solution so that we 
need to use an iterative scheme, such as the expectation maximization 
algorithm, to estimate the model parameters. While in PPCA all station- 
ary points are global optima, this no longer holds for FA. Compared to 
PPCA, FA does not change if we scale the data, but it does return different 
solutions if we rotate the data. 


An algorithm that is also closely related to PCA is independent com- 
ponent analysis (ICA (Hyvarinen et al., 2001)). Starting again with the 
latent-variable perspective p(x, | zn) = N (£n | Bzn + u, o°I) we now 
change the prior on z,, to non-Gaussian distributions. ICA can be used 
for blind-source separation. Imagine you are in a busy train station with 
many people talking. Your ears play the role of microphones, and they 
linearly mix different speech signals in the train station. The goal of blind- 
source separation is to identify the constituent parts of the mixed signals. 
As discussed previously in the context of maximum likelihood estimation 
for PPCA, the original PCA solution is invariant to any rotation. Therefore, 
PCA can identify the best lower-dimensional subspace in which the sig- 
nals live, but not the signals themselves (Murphy, 2012). ICA addresses 
this issue by modifying the prior distribution p(z) on the latent sources 
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to require non-Gaussian priors p(z). We refer to the books by Hyvarinen 
et al. (2001) and Murphy (2012) for more details on ICA. 

PCA, factor analysis, and ICA are three examples for dimensionality re- 
duction with linear models. Cunningham and Ghahramani (2015) provide 
a broader survey of linear dimensionality reduction. 

The (P)PCA model we discussed here allows for several important ex- 
tensions. In Section 10.5, we explained how to do PCA when the in- 
put dimensionality D is significantly greater than the number N of data 
points. By exploiting the insight that PCA can be performed by computing 
(many) inner products, this idea can be pushed to the extreme by consid- 
ering infinite-dimensional features. The kernel trick is the basis of kernel 
PCA and allows us to implicitly compute inner products between infinite- 
dimensional features (Schdlkopf et al., 1998; Scholkopf and Smola, 2002). 

There are nonlinear dimensionality reduction techniques that are de- 
rived from PCA (Burges (2010) provides a good overview). The auto- 
encoder perspective of PCA that we discussed previously in this section 
can be used to render PCA as a special case of a deep auto-encoder. In the 
deep auto-encoder, both the encoder and the decoder are represented by 
multilayer feedforward neural networks, which themselves are nonlinear 
mappings. If we set the activation functions in these neural networks to be 
the identity, the model becomes equivalent to PCA. A different approach to 
nonlinear dimensionality reduction is the Gaussian process latent-variable 
model (GP-LVM) proposed by Lawrence (2005). The GP-LVM starts off with 
the latent-variable perspective that we used to derive PPCA and replaces 
the linear relationship between the latent variables z and the observations 
«x with a Gaussian process (GP). Instead of estimating the parameters of 
the mapping (as we do in PPCA), the GP-LVM marginalizes out the model 
parameters and makes point estimates of the latent variables z. Similar 
to Bayesian PCA, the Bayesian GP-LVM proposed by Titsias and Lawrence 
(2010) maintains a distribution on the latent variables z and uses approx- 
imate inference to integrate them out as well. 
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Density Estimation with Gaussian Mixture 
Models 


In earlier chapters, we covered already two fundamental problems in 
machine learning: regression (Chapter 9) and dimensionality reduction 
(Chapter 10). In this chapter, we will have a look at a third pillar of ma- 
chine learning: density estimation. On our journey, we introduce impor- 
tant concepts, such as the expectation maximization (EM) algorithm and 
a latent variable perspective of density estimation with mixture models. 

When we apply machine learning to data we often aim to represent 
data in some way. A straightforward way is to take the data points them- 
selves as the representation of the data; see Figure 11.1 for an example. 
However, this approach may be unhelpful if the dataset is huge or if we 
are interested in representing characteristics of the data. In density esti- 
mation, we represent the data compactly using a density from a paramet- 
ric family, e.g., a Gaussian or Beta distribution. For example, we may be 
looking for the mean and variance of a dataset in order to represent the 
data compactly using a Gaussian distribution. The mean and variance can 
be found using tools we discussed in Section 8.3: maximum likelihood or 
maximum a posteriori estimation. We can then use the mean and variance 
of this Gaussian to represent the distribution underlying the data, i.e., we 
think of the dataset to be a typical realization from this distribution if we 
were to sample from it. 
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In practice, the Gaussian (or similarly all other distributions we encoun- 
tered so far) have limited modeling capabilities. For example, a Gaussian 
approximation of the density that generated the data in Figure 11.1 would 
be a poor approximation. In the following, we will look at a more ex- 
pressive family of distributions, which we can use for density estimation: 
mixture models. 

Mixture models can be used to describe a distribution p(x) by a convex 
combination of K simple (base) distributions 


K 
p(x) = 5° mep(x) (11.1) 
k=1 
K 
Ocal. rg I (11.2) 
k=1 


where the components p, are members of a family of basic distributions, 
e.g., Gaussians, Bernoullis, or Gammas, and the m, are mixture weights. 
Mixture models are more expressive than the corresponding base distri- 
butions because they allow for multimodal data representations, i.e., they 
can describe datasets with multiple “clusters”, such as the example in Fig- 
ure 11.1. 

We will focus on Gaussian mixture models (GMMs), where the basic 
distributions are Gaussians. For a given dataset, we aim to maximize the 
likelihood of the model parameters to train the GMM. For this purpose, 
we will use results from Chapter 5, Chapter 6, and Section 7.2. However, 
unlike other applications we discussed earlier (linear regression or PCA), 
we will not find a closed-form maximum likelihood solution. Instead, we 
will arrive at a set of dependent simultaneous equations, which we can 
only solve iteratively. 


11.1 Gaussian Mixture Model 


A Gaussian mixture model is a density model where we combine a finite 
number of K Gaussian distributions N (æ | p,p, X+) so that 


K 
p(a@|@) = So mM (@ | Hp, Xr) (11.3) 
k=1 
K 
enes; D iph (11.4) 
k=1 


where we defined 0 := {u}, Ek, Tk : k = 1,..., K} as the collection of 
all parameters of the model. This convex combination of Gaussian distri- 
bution gives us significantly more flexibility for modeling complex densi- 
ties than a simple Gaussian distribution (which we recover from (11.3) for 
K = 1). An illustration is given in Figure 11.2, displaying the weighted 
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components and the mixture density, which is given as 


plz|0) =0.5N (£| — 2, +) + 0.2M (z |1, 2) + 0.3M (£| 4, 1). (11.5) 


11.2 Parameter Learning via Maximum Likelihood 


Assume we are given a dataset ¥ = {g£1,..., £y}, where £n, n = 
1,...,.N, are drawn i.i.d. from an unknown distribution p(a). Our ob- 
jective is to find a good approximation/representation of this unknown 
distribution p(x) by means of a GMM with K mixture components. The 
parameters of the GMM are the K means y,, the covariances %;,, and 
mixture weights 7,. We summarize all these free parameters in 8 := 
{Tk, Hp, Ek k= 1,..., K}. 


Example 11.1 (Initial Setting) 





0.30 ===. mN (z|m,07) 
--- mM (2|p2, 03) 


0.25 


--- TaN (z|u3, 02) 


— GMM density 














Throughout this chapter, we will have a simple running example that 
helps us illustrate and visualize important concepts. 
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We consider a one-dimensional dataset ¥ = {—3, —2.5, —1, 0,2,4,5} 
consisting of seven data points and wish to find a GMM with kK = 3 
components that models the density of the data. We initialize the mixture 
components as 


pı(z) =N (z| — 4, 1) (11.6) 
p(x) = N(x | 0, 0.2) (11.7) 
p3(z) = N(x |8, 3) (11.8) 


and assign them equal weights mı = m2 = 73 = a The corresponding 
model (and the data points) are shown in Figure 11.3. 


In the following, we detail how to obtain a maximum likelihood esti- 
mate 0), of the model parameters 8. We start by writing down the like- 
lihood, i.e., the predictive distribution of the training data given the pa- 
rameters. We exploit our i.i.d. assumption, which leads to the factorized 
likelihood 


K 
p(X |0) = [m En|0), pPlEn| 0) =X TN (£n | oy, De), (11.9) 


k=1 


where every individual likelihood term p(æ„ |0) is a Gaussian mixture 
density. Then we obtain the log-likelihood as 


log p(X | 0) = Sox x, |0) = OI £n | Hp, Er); (11.10) 
n=1 





=f 

We aim to find parameters Oý, that maximize the log-likelihood £ defined 
in (11.10). Our “normal” procedure would be to compute the gradient 
d£/dé of the log-likelihood with respect to the model parameters 9, set 
it to O, and solve for 0. However, unlike our previous examples for max- 
imum likelihood estimation (e.g., when we discussed linear regression in 
Section 9.2), we cannot obtain a closed-form solution. However, we can 
exploit an iterative scheme to find good model parameters Oy, which will 
turn out to be the EM algorithm for GMMs. The key idea is to update one 
model parameter at a time while keeping the others fixed. 


Remark. If we were to consider a single Gaussian as the desired density, 
the sum over k in (11.10) vanishes, and the log can be applied directly to 
the Gaussian component, such that we get 


log N (z| u, £) = —2 log(27) — $ log det(X) — 4(a@ — pp)’ E*(a — p). 


(11.11) 


This simple form allows us to find closed-form maximum likelihood esti- 
mates of ys and &, as discussed in Chapter 8. In (11.10), we cannot move 


©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020). 


responsibility 


Tn follows a 
Boltzmann/Gibbs 
distribution. 


352 Density Estimation with Gaussian Mixture Models 


the log into the sum over k so that we cannot obtain a simple closed-form 
maximum likelihood solution. © 


Any local optimum of a function exhibits the property that its gradi- 
ent with respect to the parameters must vanish (necessary condition); see 
Chapter 7. In our case, we obtain the following necessary conditions when 
we optimize the log-likelihood in (11.10) with respect to the GMM param- 
eters Up, Nk, Tk: 


OL T a ð log p(æn |0) T 

Hy > Hy 

OL “\ Alog p(an|O) _ 

a5, =0 <> a. 3s, =0, (11.13) 
aL “ Alog p(a,|O) _ 

ae (11.14) 


Tk 


n=1 


For all three necessary conditions, by applying the chain rule (see Sec- 
tion 5.2.2), we require partial derivatives of the form 


O log p(xn | 4) 1 = Op(an| 6) 





= 11.15 
a pel 068” ai 
where 0 = {u}, k, Tk, k = 1,..., K} are the model parameters and 
1 ll 
= : (11.16) 





P(Ln | 0) La TN (an | Hj, x) 


In the following, we will compute the partial derivatives (11.12) through 
(11.14). But before we do this, we introduce a quantity that will play a 
central role in the remainder of this chapter: responsibilities. 


11.2.1 Responsibilities 
We define the quantity 


TN (an | Hg, De 
fp ON a (11.17) 
LA TiN (Ln | Hj, =) 
as the responsibility of the kth mixture component for the nth data point. 
The responsibility rax of the kth mixture component for data point æ, is 
proportional to the likelihood 


P(En | Tk; Hp Dr) = TN (En | Hr Zr) (11.18) 


of the mixture component given the data point. Therefore, mixture com- 
ponents have a high responsibility for a data point when the data point 
could be a plausible sample from that mixture component. Note that 
> = as Tnx)! € R* is a (normalized) probability vector, i.e., 
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yep lnk = 1 with r,;, > 0. This probability vector distributes probabil- 
ity mass among the K mixture components, and we can think of r,, as a 
“soft assignment” of £, to the K mixture components. Therefore, the re- 
sponsibility ra from (11.17) represents the probability that æx„ has been 
generated by the kth mixture component. 


Example 11.2 (Responsibilities) 
For our example from Figure 11.3, we compute the responsibilities r,,;, 


1.0 00 00 
1.0 00 00 
0.057 0.943 0.0 
0.001 0.999 0.0 | eRX**, (11.19) 
0.0 0.066 0.934 
00 00 10 
00 00 10 


Here the nth row tells us the responsibilities of all mixture components 
for x,. The sum of all K responsibilities for a data point (sum of every 
row) is 1. The kth column gives us an overview of the responsibility of 
the kth mixture component. We can see that the third mixture component 
(third column) is not responsible for any of the first four data points, but 
takes much responsibility of the remaining data points. The sum of all 
entries of a column gives us the values N;,, i.e., the total responsibility of 
the Ath mixture component. In our example, we get N, = 2.058, No = 
2.008, N3 = 2.934. 


In the following, we determine the updates of the model parameters 
Hg, Xk, Tk for given responsibilities. We will see that the update equa- 
tions all depend on the responsibilities, which makes a closed-form solu- 
tion to the maximum likelihood estimation problem impossible. However, 
for given responsibilities we will be updating one model parameter at a 
time, while keeping the others fixed. After this, we will recompute the 
responsibilities. Iterating these two steps will eventually converge to a lo- 
cal optimum and is a specific instantiation of the EM algorithm. We will 
discuss this in some more detail in Section 11.3. 


11.2.2 Updating the Means 
Theorem 11.1 (Update of the GMM Means). The update of the mean pa- 
rameters u, k = 1,..., K, of the GMM is given by 


N 
p” = Lenzi Mnk@n Vakin (11.20) 


er Tnk 


where the responsibilities r,,; are defined in (11.17). 
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Remark. The update of the means m, of the individual mixture compo- 
nents in (11.20) depends on all means, covariance matrices %,, and mix- 
ture weights 7, via r,, given in (11.17). Therefore, we cannot obtain a 
closed-form solution for all yz, at once. Q 


Proof From (11.15), we see that the gradient of the log-likelihood with 
respect to the mean parameters p}, k = 1,..., K, requires us to compute 
the partial derivative 





= Ln ’ j o n kd dk 
Ly wel 0) =Y r; a j) =" N (x | Hk k) (11.21a) 
j=l Hk OM, 
= Tk(En — M a N (e |fe 2e), (11.21b) 


where we exploited that only the kth mixture component depends on py. 
We use our result from (11.21b) in (11.15) and put everything together 
so that the desired partial derivative of £ with respect to u, is given as 














aL ðlogp(æn|0) 1 Ap(wn| 8) 
es = il (11.22a) 
OM, 2 Hy = 2296 P(®n|A) Op, 
N 
{dq | tes De 
=Y (£n m) Ez La n (11.22b) 
n=l Lz 175 N (Ea | Hj, X;) 
Å 
. = 
=ý pate Be (11.220) 
n=1 


Here we used the identity from (11.16) and the result of the partial deriva- 
tive in (11.21b) to get to (11.22b). The values r,,, are the responsibilities 
we defined in (11.17). 

We now solve (11.22c) for uł" so that ee = 0! and obtain 


> > DN, Takan 
= new new __ n=1!' nk 
TnkEn = TnkHk < Hk = 


N = 5 TnkËn , 
n=1 n=1 Jra Tnk “ia 


(11.23) 


where we defined 


N; := > tae (11.24) 


n=1 


as the total responsibility of the kth mixture component for the entire 
dataset. This concludes the proof of Theorem 11.1. 














Intuitively, (11.20) can be interpreted as an importance-weighted Monte 
Carlo estimate of the mean, where the importance weights of data point 
£n are the responsibilities r,,, of the kth cluster for æn, k = 1,...,K. 


Draft (2021-01-14) of “Mathematics for Machine Learning”. Feedback: https ://mml-book. com. 


11.2 Parameter Learning via Maximum Likelihood 355 


Therefore, the mean py, is pulled toward a data point «x, with strength 
given by rax. The means are pulled stronger toward data points for which 
the corresponding mixture component has a high responsibility, i.e., a high 
likelihood. Figure 11.4 illustrates this. We can also interpret the mean up- 
date in (11.20) as the expected value of all data points under the distri- 
bution given by 


Tp i= [riks TN] [Nes (11.25) 
which is a normalized probability vector, i.e., 
Hy — E,, [|X]. (11.26) 


Example 11.3 (Mean Updates) 








0.30 a= mM (e|j1,02) 0.30 === THN (2|41, 07) 


=-= TN (z|u2,03) =-=- TN (z|u2, 03) 
=- TN (z|u3,03) 


— GMM density 


0.25 ==- TN (z|u3, 03) 0.25 
— GMM density 
0.20 


0.10 


0.05 





0.00 


—5 0 5 10 15 —5 0 5 10 15 
z T 








(b) GMM density and individual components 
after updating the mean values. 


(a) GMM density and individual components 
prior to updating the mean values. 


In our example from Figure 11.3, the mean values are updated as fol- 
lows: 


W : —4 > —2.7 (11.27) 
l2 : 0 > —0.4 (11.28) 
ls : 8 > 3.7 (11.29) 


Here we see that the means of the first and third mixture component 
move toward the regime of the data, whereas the mean of the second 
component does not change so dramatically. Figure 11.5 illustrates this 
change, where Figure 11.5(a) shows the GMM density prior to updating 
the means and Figure 11.5(b) shows the GMM density after updating the 
mean values up. 


The update of the mean parameters in (11.20) look fairly straight- 
forward. However, note that the responsibilities r„ą are a function of 
Tj, Hj, X; for all j = 1,..., K, such that the updates in (11.20) depend 
on all parameters of the GMM, and a closed-form solution, which we ob- 
tained for linear regression in Section 9.2 or PCA in Chapter 10, cannot 
be obtained. 
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11.2.3 Updating the Covariances 


Theorem 11.2 (Updates of the GMM Covariances). The update of the co- 
variance parameters X, k = 1,..., K of the GMM is given by 


new 1 A 
y” — r Sonik (@n — by) (Bn — Me)", (11.30) 
ê n=1 


where r,,, and N;,, are defined in (11.17) and (11.24), respectively. 


Proof To prove Theorem 11.2, our approach is to compute the partial 
derivatives of the log-likelihood £ with respect to the covariances X, set 
them to 0, and solve for X4. We start with our general approach 
OL _ ys Alogp(an|8) yx 1 —_ Av(@n| 8) 
= = . 11.31 





n=1 n=1 


We already know 1/p(æn |0) from (11.16). To obtain the remaining par- 
tial derivative Op(x,, | @)/O%X;,, we write down the definition of the Gaus- 
sian distribution p(z,, | @) (see (11.9)) and drop all terms but the kth. We 
then obtain 


Op(a,, | @) 
Od» 


= yyy (men det()~2 exp (—H (en ~ saa)” E (en — He) 


Od: 
(11.32b) 


(11.32a) 


= m (27) det(3,)~2 exp (—4 (en — Hy) ER (@n — Me) 


D 
a | 
i 





1 ð 
+det(X,)~ 2 zs, P (—4(a@n — Hp) Ez (£n — m)| . (11.32¢) 
k 


We now use the identities 


o zde: 1 Slee 
gy, tE)? SS" — 5 det(Sx) am, (11.33) 





(5.103) = = 
= = 2p (En — Py) (2n — Hp) E; : 


(11.34) 


az = Hp) Ep (En — Hy) 


and obtain (after some rearranging) the desired partial derivative required 


in (11.31) as 


Op(an | 8) 


A = RuN (En | Hgs Zk) 


: [-4 (5p a Er (Ey ale Hy) (Ln ink, a S| : (11.35) 


Putting everything together, the partial derivative of the log-likelihood 
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with respect to X, is given by 


N 


OL < ôlogplæ, |0) Op(x,, | A) 
me yn! o>, = 2 ie, 18) aa as, igre) 





n=1 
-5 TN (£ a | Pa 2% k) 

n=1 DaN (x En | Hj, D) 

eS 


=lnk 


Hee = 2, Ge n r (11.36b) 


ee E : E 
-5 D rael Dk — Ep (En — My) (n= My)" Be") — (11.360) 


n=1 


Lo bea = 
==.) rat 3B" (3 alen = mæn — D 
n=1 n=1 
Sea 


=N; 


(11.36d) 


We see that the responsibilities r„ą also appear in this partial derivative. 
Setting this partial derivative to 0, we obtain the necessary optimality 
condition 


N 
ND, = Ep (>: Tnk(En — Hp) (En — m”) 5," (11.37a) 


<> N,I = (dort Tnk(En — by) (Lp - m”) Don (11.37b) 


By solving for Xp, we obtain 


new 1 y 
pow = A Sea — pt,) (2, — H), (11.38) 
n=l 


where r;, is the probability vector defined in (11.25). This gives us a sim- 
ple update rule for X, for k = 1,..., K and proves Theorem 11.2. m 


Similar to the update of u, in (11.20), we can interpret the update of 
the covariance in (11.30) as an importance-weighted expected value of 
the square of the centered data Xy := {£1 — Mgs- -, EN — Hg} 


Example 11.4 (Variance Updates) 
In our example from Figure 11.3, the variances are updated as follows: 


o? : 1 — 0.14 (11.39) 
o3 : 0.2 — 0.44 (11.40) 
a2 : 3 — 1.53 (11.41) 
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Here we see that the variances of the first and third component shrink 
significantly, whereas the variance of the second component increases 
slightly. 

Figure 11.6 illustrates this setting. Figure 11.6(a) is identical (but 
zoomed in) to Figure 11.5(b) and shows the GMM density and its indi- 
vidual components prior to updating the variances. Figure 11.6(b) shows 
the GMM density after updating the variances. 











0.30 =-= mN(x\1,07) 0.35 --- mM (2|n,0? 






--- TN (z|u2, 03) --- WN (242, 03 





0.25 --- TN (elus, 03) -= TN (alus o 


— GMM density 0.25 — GMM densit 


0.20 








fie 
0.00j------ @e----e “@------0---_-e --e-- > 0.00 


=) 0 a 4 6 8 ==? ò 2 4 6 8 
T zr 


(a) GMM density and individual components (b) GMM density and individual components 
prior to updating the variances. after updating the variances. 


Similar to the update of the mean parameters, we can interpret (11.30) 
as a Monte Carlo estimate of the weighted covariance of data points £, 
associated with the kth mixture component, where the weights are the 
responsibilities r,,,. As with the updates of the mean parameters, this up- 
date depends on all 7;,4;, 45, 7 = 1,..., K, through the responsibilities 
Tnk, Which prohibits a closed-form solution. 


11.2.4 Updating the Mixture Weights 


Theorem 11.3 (Update of the GMM Mixture Weights). The mixture weights 
of the GMM are updated as 
Ni. 
, =, FElewas, 11.42 
Tk N (11.42) 
where N is the number of data points and N, is defined in (11.24). 


Proof To find the partial derivative of the log-likelihood with respect 
to the weight parameters mk, k = 1,..., K, we account for the con- 
straint J`, mk = 1 by using Lagrange multipliers (see Section 7.2). The 
Lagrangian is 


K 
£L=L+À Som-1 (11.43a) 


k=1 
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N K K 
= SC log So mM (en | Hr Xx) +A (>: Th — r) ,  (11.43b) 
k=1 k=1 


n=1 
where CL is the log-likelihood from (11.10) and the second term encodes 


for the equality constraint that all the mixture weights need to sum up to 
1. We obtain the partial derivative with respect to 7, as 








NN (an | ty, E 
aS x (Ealen D) a (11.44a) 
On n=1 at TiN (En | Hj, X;) 
se s MN (EnH e) aN, (11.44) 
TK p Pa TIN (Ln | Hj, ©) Mk 
hh 


=N; 


and the partial derivative with respect to the Lagrange multiplier À as 
ag Š 
T ZS (11.45) 
k=1 


Setting both partial derivatives to 0 (necessary condition for optimum) 
yields the system of equations 


N, 
Th = ey (11.46) 
K 
1 Se Nore (11.47) 
k=1 
Using (11.46) in (11.47) and solving for 7;, we obtain 
K K 
N, N 
Somal  -So Sal  -T H=1 = N. 
A AÀ 
k=1 k=1 
(11.48) 
This allows us to substitute — N for A in (11.46) to obtain 
Ne 
ee 11. 
Ty, N’ (11.49) 


which gives us the update for the weight parameters 7; and proves Theo- 
rem 11.3. 














We can identify the mixture weight in (11.42) as the ratio of the to- 
tal responsibility of the Ath cluster and the number of data points. Since 
N = 3°, Nx, the number of data points can also be interpreted as the 
total responsibility of all mixture components together, such that m is the 
relative importance of the kth mixture component for the dataset. 


Remark. Since N; = eo Tnk, the update equation (11.42) for the mix- 
ture weights 7 also depends on all mj, y;, 5j, j = 1,..., K via the re- 
sponsibilities rnp. © 
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GMM. (a) GMM 
before updating the 
mixture weights; 
(b) GMM after 
updating the 
mixture weights 
while retaining the 
means and 
variances. Note the 
different scales of 
the vertical axes. 
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Example 11.5 (Weight Parameter Updates) 
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(a) GMM density and individual components 
prior to updating the mixture weights. 


(b) GMM density and individual components 
after updating the mixture weights. 


In our running example from Figure 11.3, the mixture weights are up- 
dated as follows: 


Ti: = —> 0.29 (11.50) 
To : 3 —> 0.29 (11.51) 
T3 : 3 — 0.42 (11.52) 


Here we see that the third component gets more weight/importance, 
while the other components become slightly less important. Figure 11.7 
illustrates the effect of updating the mixture weights. Figure 11.7(a) is 
identical to Figure 11.6(b) and shows the GMM density and its individual 
components prior to updating the mixture weights. Figure 11.7(b) shows 
the GMM density after updating the mixture weights. 

Overall, having updated the means, the variances, and the weights 
once, we obtain the GMM shown in Figure 11.7(b). Compared with the 
initialization shown in Figure 11.3, we can see that the parameter updates 
caused the GMM density to shift some of its mass toward the data points. 

After updating the means, variances, and weights once, the GMM fit 
in Figure 11.7(b) is already remarkably better than its initialization from 
Figure 11.3. This is also evidenced by the log-likelihood values, which in- 
creased from 28.3 (initialization) to 14.4 after one complete update cycle. 


11.3 EM Algorithm 


Unfortunately, the updates in (11.20), (11.30), and (11.42) do not consti- 
tute a closed-form solution for the updates of the parameters Hy, Hp, Tk 
of the mixture model because the responsibilities r„ą depend on those pa- 
rameters in a complex way. However, the results suggest a simple iterative 
scheme for finding a solution to the parameters estimation problem via 
maximum likelihood. The expectation maximization algorithm (EM algo- 


Draft (2021-01-14) of “Mathematics for Machine Learning”. Feedback: https: //mm1-book.com. 


11.3 EM Algorithm 361 


rithm) was proposed by Dempster et al. (1977) and is a general iterative 
scheme for learning parameters (maximum likelihood or MAP) in mixture 
models and, more generally, latent-variable models. 

In our example of the Gaussian mixture model, we choose initial values 
for uy, p, Tk and alternate until convergence between 


« F-step: Evaluate the responsibilities r,,; (posterior probability of data 
point n belonging to mixture component k). 

« M-step: Use the updated responsibilities to reestimate the parameters 
Hk, Xk, Tk- 

Every step in the EM algorithm increases the log-likelihood function (Neal 

and Hinton, 1999). For convergence, we can check the log-likelihood or 

the parameters directly. A concrete instantiation of the EM algorithm for 

estimating the parameters of a GMM is as follows: 


1. Initialize p}, Xk, Tk. 
2. E-step: Evaluate responsibilities r,,, for every data point x,, using cur- 
rent parameters Tk, Hp, Mp: 

TN (£n | Hk Ez) 
ae TN (En | Hj, 5) 
3. M-step: Reestimate parameters Tk, Hp, 4; using the current responsi- 

bilities r,,;, (from E-step): 


(11.53) 


Tnk = 


N 
1 
= Yorkin, 11.54 
He = 5 D pæ (11.54) 
1 N 
E, = a yas — p,)(@n — py)" , (11.55) 
k n=1 
N; 
Th = ag! (11.56) 


Example 11.6 (GMM Fit) 
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(a) Final GMM fit. After five iterations, the EM (b) Negative log-likelihood as a function of the 
algorithm converges and returns this GMM. EM iterations. 
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Having updated the 
means Hp 

in (11.54), they are 
subsequently used 
in (11.55) to update 
the corresponding 
covariances. 


Figure 11.8 EM 
algorithm applied to 
the GMM from 
Figure 11.2. (a) 
Final GMM fit; 

(b) negative 
log-likelihood as a 
function of the EM 
iteration. 


Figure 11.9 
Illustration of the 
EM algorithm for 
fitting a Gaussian 
mixture model with 
three components to 
a two-dimensional 
dataset. (a) Dataset; 
(b) negative 
log-likelihood 
(lower is better) as 
a function of the EM 
iterations. The red 
dots indicate the 
iterations for which 
the mixture 
components of the 
corresponding GMM 
fits are shown in (c) 
through (f). The 
yellow discs indicate 
the means of the 
Gaussian mixture 
components. 

Figure 11.10(a) 
shows the final 
GMM fit. 
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(e) EM after 10 iterations. (f) EM after 62 iterations. 


When we run EM on our example from Figure 11.3, we obtain the final 
result shown in Figure 11.8(a) after five iterations, and Figure 11.8(b) 
shows how the negative log-likelihood evolves as a function of the EM 
iterations. The final GMM is given as 


p(x) = 0.29N (x| — 2.75, 0.06) + 0.28M (x | — 0.50, 0.25) 


(11.57) 
+ 0.43N (a | 3.64, 1.63) . 


We applied the EM algorithm to the two-dimensional dataset shown 
in Figure 11.1 with kK = 3 mixture components. Figure 11.9 illustrates 
some steps of the EM algorithm and shows the negative log-likelihood as 
a function of the EM iteration (Figure 11.9(b)). Figure 11.10(a) shows 
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(a) GMM fit after 62 iterations. (b) Dataset colored according to the respon- 
sibilities of the mixture components. 


the corresponding final GMM fit. Figure 11.10(b) visualizes the final re- 
sponsibilities of the mixture components for the data points. The dataset is 
colored according to the responsibilities of the mixture components when 
EM converges. While a single mixture component is clearly responsible 
for the data on the left, the overlap of the two data clusters on the right 
could have been generated by two mixture components. It becomes clear 
that there are data points that cannot be uniquely assigned to a single 
component (either blue or yellow), such that the responsibilities of these 
two clusters for those points are around 0.5. 


11.4 Latent-Variable Perspective 


We can look at the GMM from the perspective of a discrete latent-variable 
model, i.e., where the latent variable z can attain only a finite set of val- 
ues. This is in contrast to PCA, where the latent variables were continuous- 
valued numbers in R™. 

The advantages of the probabilistic perspective are that (i) it will jus- 
tify some ad hoc decisions we made in the previous sections, (ii) it allows 
for a concrete interpretation of the responsibilities as posterior probabil- 
ities, and (iii) the iterative algorithm for updating the model parameters 
can be derived in a principled manner as the EM algorithm for maximum 
likelihood parameter estimation in latent-variable models. 


11.4.1 Generative Process and Probabilistic Model 


To derive the probabilistic model for GMMs, it is useful to think about the 
generative process, i.e., the process that allows us to generate data, using 
a probabilistic model. 

We assume a mixture model with K components and that a data point 
ax can be generated by exactly one mixture component. We introduce a 
binary indicator variable z, € {0, 1} with two states (see Section 6.2) that 
indicates whether the Ath mixture component generated that data point 
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Figure 11.10 GMM 
fit and 
responsibilities 
when EM converges. 
(a) GMM fit when 
EM converges; 

(b) each data point 
is colored according 
to the 
responsibilities of 
the mixture 
components. 
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representation 


Figure 11.11 
Graphical model for 
a GMM with a single 
data point. 
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so that 
pla |zr = 1) =N (x| up, Xr). (11.58) 


We define z := [z1,..., z|" © R* as a probability vector consisting of 
K — 1 many 0s and exactly one 1. For example, for K = 3, a valid z would 
be z = [21, 22, z3]' = [0,1,0]', which would select the second mixture 
component since z2 = 1. 


Remark. Sometimes this kind of probability distribution is called “multi- 
noulli”, a generalization of the Bernoulli distribution to more than two 
values (Murphy, 2012). &» 


The properties of z imply that YS Zk = 1. Therefore, z is a one-hot 
encoding (also: 1-of-K representation). 

Thus far, we assumed that the indicator variables z, are known. How- 
ever, in practice, this is not the case, and we place a prior distribution 


K 
p(z) =r = [m,..., TK], S msl, (11.59) 
k=1 


on the latent variable z. Then the kth entry 
Tk = plzk = 1) (11.60) 


of this probability vector describes the probability that the kth mixture 
component generated data point x. 


Remark (Sampling from a GMM). The construction of this latent-variable 
model (see the corresponding graphical model in Figure 11.11) lends it- 
self to a very simple sampling procedure (generative process) to generate 
data: 


1. Sample z ~ p(z). 
2. Sample x ~ p(x|z =1). 


In the first step, we select a mixture component 7 (via the one-hot encod- 
ing z) at random according to p(z) = 7; in the second step we draw a 
sample from the corresponding mixture component. When we discard the 
samples of the latent variable so that we are left with the 2, we have 
valid samples from the GMM. This kind of sampling, where samples of 
random variables depend on samples from the variable’s parents in the 
graphical model, is called ancestral sampling. ro 


Generally, a probabilistic model is defined by the joint distribution of 
the data and the latent variables (see Section 8.4). With the prior p(z) 
defined in (11.59) and (11.60) and the conditional p(a | z) from (11.58), 
we obtain all kK components of this joint distribution via 


p(z, Zz = 1) = p(x | z = 1)p(z, = 1) = m;,N (£ | Hp, Ex) (11.61) 
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fork =1,...,K, so that 
p(a, 2; = 1) TN (a | 4, 5) 
p(x, z) = ; = : , (11.62) 
plz, zg = 1) TKN («| bx, Ux) 
which fully specifies the probabilistic model. 


11.4.2 Likelihood 


To obtain the likelihood p(a|@) in a latent-variable model, we need to 
marginalize out the latent variables (see Section 8.4.3). In our case, this 
can be done by summing out all latent variables from the joint p(x, z) 
in (11.62) so that 


p(x|0) = > plelop (z|0), 0 := {Hh Ek, Tk: k=1,..., K}. 


(11.63) 


We now explicitly condition on the parameters 0 of the probabilistic model, 
which we previously omitted. In (11.63), we sum over all K possible one- 
hot encodings of z, which is denoted by _,. Since there is only a single 
nonzero single entry in each z there are only K possible configurations/ 
settings of z. For example, if K = 3, then z can have the configurations 


Ji 


Summing over all possible configurations of z in (11.63) is equivalent to 
looking at the nonzero entry of the z-vector and writing 


p(x|80) = es (z| 0) (11.65a) 
K 

= X pla |0, zk = 1)p(zk = 1 |0) (11.65b) 
k=l 


so that the desired marginal distribution is given as 


K 
p(x | 0) “EP S p(x | 8, 2% = 1)p(ze = 118) (11.66a) 
k=1 
K 
= So mM (a | uy, Ze), (11.66b) 
k=1 


which we identify as the GMM model from (11.3). Given a dataset V, we 
immediately obtain the likelihood 


p(X | 8) = [r £n |0) ) H TIS nN (Enl Hr Er), (11.67) 


n=1k=1 


©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020). 


Figure 11.12 
Graphical model for 
a GMM with N data 
points. 
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T 





which is exactly the GMM likelihood from (11.9). Therefore, the latent- 
variable model with latent indicators zę is an equivalent way of thinking 
about a Gaussian mixture model. 


11.4.3 Posterior Distribution 


Let us have a brief look at the posterior distribution on the latent variable 
z. According to Bayes’ theorem, the posterior of the kth component having 
generated data point x 


p(2% = 1)p(#| 2% = 1) 
p(x) 


where the marginal p(a) is given in (11.66b). This yields the posterior 
distribution for the kth indicator variable zę 


p(zk = 1|£) = : (11.68) 





plz = 1 |£) = p(zr = l)p(æ | zx = 1) — TN (| Hy, Ex) 
dy = Dp =) Ea TN (E |u; Z) 
(11.69) 


which we identify as the responsibility of the kth mixture component for 
data point a. Note that we omitted the explicit conditioning on the GMM 
parameters Tk, ;,, 4, where k =1,...,K. 


11.4.4 Extension to a Full Dataset 


Thus far, we have only discussed the case where the dataset consists only 
of a single data point x. However, the concepts of the prior and posterior 
can be directly extended to the case of N data points V := {a,...,ay}. 

In the probabilistic interpretation of the GMM, every data point z,, pos- 
sesses its own latent variable 


Za = [znak] ERË. (11.70) 
Previously (when we only considered a single data point x), we omitted 


the index n, but now this becomes important. 
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We share the same prior distribution 7 across all latent variables z,,. 
The corresponding graphical model is shown in Figure 11.12, where we 
use the plate notation. 

The conditional distribution p(a,,...,@y | Z1,...,2n) factorizes over 
the data points and is given as 


N 
Pay siep@n |Fageecen) = [| ven |e) (11.71) 
n=1 


To obtain the posterior distribution p(z,, = 1|x,), we follow the same 
reasoning as in Section 11.4.3 and apply Bayes’ theorem to obtain 


P(®n | Znk = 1)p(Znk = 1) 
Tapes 1)p (Znj =1) 


a TN (talha Ba) at 70) 


Xa TiN (En | Hj» Z;) 


This means that p(z = 1| æņ„) is the (posterior) probability that the kth 
mixture component generated data point x,, and corresponds to the re- 
sponsibility r,,, we introduced in (11.17). Now the responsibilities also 
have not only an intuitive but also a mathematically justified interpreta- 
tion as posterior probabilities. 





Plzni = 1| £n) = (11.72a) 


11.4.5 EM Algorithm Revisited 


The EM algorithm that we introduced as an iterative scheme for maximum 
likelihood estimation can be derived in a principled way from the latent- 
variable perspective. Given a current setting 6 of model parameters, the 
E-step calculates the expected log-likelihood 


Q(0| 0) = E,\2,0~ [log p(a, z | 8)] (11.73a) 
z f bips opele o de; (11.73b) 


where the expectation of log p(x, z | @) is taken with respect to the poste- 
rior p(z | a, 0) of the latent variables. The M-step selects an updated set 
of model parameters ett) by maximizing (11.73b). 

Although an EM iteration does increase the log-likelihood, there are 
no guarantees that EM converges to the maximum likelihood solution. 
It is possible that the EM algorithm converges to a local maximum of 
the log-likelihood. Different initializations of the parameters @ could be 
used in multiple EM runs to reduce the risk of ending up in a bad local 
optimum. We do not go into further details here, but refer to the excellent 
expositions by Rogers and Girolami (2016) and Bishop (2006). 
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The GMM can be considered a generative model in the sense that it is 
straightforward to generate new data using ancestral sampling (Bishop, 
2006). For given GMM parameters Ty, Hg, Mp, k = 1,..., K, we sample 
an index k from the probability vector [7,,...,7]' and then sample a 
data point x ~ N (up, Ey). If we repeat this N times, we obtain a dataset 
that has been generated by a GMM. Figure 11.1 was generated using this 
procedure. 

Throughout this chapter, we assumed that the number of components 
K is known. In practice, this is often not the case. However, we could use 
nested cross-validation, as discussed in Section 8.6.1, to find good models. 

Gaussian mixture models are closely related to the K-means clustering 
algorithm. K-means also uses the EM algorithm to assign data points to 
clusters. If we treat the means in the GMM as cluster centers and ignore 
the covariances (or set them to J), we arrive at k-means. As also nicely 
described by MacKay (2003), K-means makes a “hard” assignment of data 
points to cluster centers j4,, whereas a GMM makes a “soft” assignment 
via the responsibilities. 

We only touched upon the latent-variable perspective of GMMs and the 
EM algorithm. Note that EM can be used for parameter learning in general 
latent-variable models, e.g., nonlinear state-space models (Ghahramani 
and Roweis, 1999; Roweis and Ghahramani, 1999) and for reinforcement 
learning as discussed by Barber (2012). Therefore, the latent-variable per- 
spective of a GMM is useful to derive the corresponding EM algorithm in 
a principled way (Bishop, 2006; Barber, 2012; Murphy, 2012). 

We only discussed maximum likelihood estimation (via the EM algo- 
rithm) for finding GMM parameters. The standard criticisms of maximum 
likelihood also apply here: 


= As in linear regression, maximum likelihood can suffer from severe 
overfitting. In the GMM case, this happens when the mean of a mix- 
ture component is identical to a data point and the covariance tends to 
0. Then, the likelihood approaches infinity. Bishop (2006) and Barber 
(2012) discuss this issue in detail. 

= We only obtain a point estimate of the parameters mp, Hp, 4; for k = 
1,..., K, which does not give any indication of uncertainty in the pa- 
rameter values. A Bayesian approach would place a prior on the param- 
eters, which can be used to obtain a posterior distribution on the param- 
eters. This posterior allows us to compute the model evidence (marginal 
likelihood), which can be used for model comparison, which gives us a 
principled way to determine the number of mixture components. Un- 
fortunately, closed-form inference is not possible in this setting because 
there is no conjugate prior for this model. However, approximations, 
such as variational inference, can be used to obtain an approximate 
posterior (Bishop, 2006). 
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In this chapter, we discussed mixture models for density estimation. 
There is a plethora of density estimation techniques available. In practice, 
we often use histograms and kernel density estimation. 

Histograms provide a nonparametric way to represent continuous den- 
sities and have been proposed by Pearson (1895). A histogram is con- 
structed by “binning” the data space and count, how many data points fall 
into each bin. Then a bar is drawn at the center of each bin, and the height 
of the bar is proportional to the number of data points within that bin. The 
bin size is a critical hyperparameter, and a bad choice can lead to overfit- 
ting and underfitting. Cross-validation, as discussed in Section 8.2.4, can 
be used to determine a good bin size. 

Kernel density estimation, independently proposed by Rosenblatt (1956) 
and Parzen (1962), is a nonparametric way for density estimation. Given 
N i.i.d. samples, the kernel density estimator represents the underlying 
distribution as 


I L — En 
p(x) = wna ( 7 ) , (11.74) 
where k is a kernel function, i.e., a nonnegative function that integrates to 
l and h > 0 is a smoothing/bandwidth parameter, which plays a similar 
role as the bin size in histograms. Note that we place a kernel on every 
single data point æn in the dataset. Commonly used kernel functions are 
the uniform distribution and the Gaussian distribution. Kernel density esti- 
mates are closely related to histograms, but by choosing a suitable kernel, 
we can guarantee smoothness of the density estimate. Figure 11.13 illus- 
trates the difference between a histogram and a kernel density estimator 
(with a Gaussian-shaped kernel) for a given dataset of 250 data points. 
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Figure 11.13 
Histogram (orange 
bars) and kernel 
density estimation 
(blue line). The 
kernel density 
estimator produces 
a smooth estimate 
of the underlying 
density, whereas the 
histogram is an 
unsmoothed count 
measure of how 
many data points 
(black) fall into a 
single bin. 
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An example of 
structure is if the 


outcomes were 
ordered, like in the 
case of small, 
medium, and large 
t-shirts. 

binary classification 


Input example £n 
may also be referred 
to as inputs, data 
points, features, or 
instances. 

class 


For probabilistic 
models, it is 
mathematically 
convenient to use 
{0, 1} as a binary 
representation; see 
the remark after 
Example 6.12. 
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Classification with Support Vector Machines 


In many situations, we want our machine learning algorithm to predict 
one of a number of (discrete) outcomes. For example, an email client sorts 
mail into personal mail and junk mail, which has two outcomes. Another 
example is a telescope that identifies whether an object in the night sky 
is a galaxy, star, or planet. There are usually a small number of outcomes, 
and more importantly there is usually no additional structure on these 
outcomes. In this chapter, we consider predictors that output binary val- 
ues, i.e., there are only two possible outcomes. This machine learning task 
is called binary classification. This is in contrast to Chapter 9, where we 
considered a prediction problem with continuous-valued outputs. 

For binary classification, the set of possible values that the label/output 
can attain is binary, and for this chapter we denote them by {+1,—1}. In 
other words, we consider predictors of the form 


f:R? > {+1,-1}. (12.1) 


Recall from Chapter 8 that we represent each example (data point) z,, 
as a feature vector of D real numbers. The labels are often referred to as 
the positive and negative classes, respectively. One should be careful not 
to infer intuitive attributes of positiveness of the +1 class. For example, 
in a cancer detection task, a patient with cancer is often labeled +1. In 
principle, any two distinct values can be used, e.g., {True, False}, {0,1} 
or {red, blue}. The problem of binary classification is well studied, and 
we defer a survey of other approaches to Section 12.6. 

We present an approach known as the support vector machine (SVM), 
which solves the binary classification task. As in regression, we have a su- 
pervised learning task, where we have a set of examples x, € RP along 
with their corresponding (binary) labels y,, € {+1,—1}. Given a train- 
ing data set consisting of example—label pairs {(2,, y1),..., (aw, yw) }, we 
would like to estimate parameters of the model that will give the smallest 
classification error. Similar to Chapter 9, we consider a linear model, and 
hide away the nonlinearity in a transformation ¢ of the examples (9.13). 
We will revisit ¢ in Section 12.4. 

The SVM provides state-of-the-art results in many applications, with 
sound theoretical guarantees (Steinwart and Christmann, 2008). There 
are two main reasons why we chose to illustrate binary classification using 
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SVMs. First, the SVM allows for a geometric way to think about supervised 
machine learning. While in Chapter 9 we considered the machine learning 
problem in terms of probabilistic models and attacked it using maximum 
likelihood estimation and Bayesian inference, here we will consider an 
alternative approach where we reason geometrically about the machine 
learning task. It relies heavily on concepts, such as inner products and 
projections, which we discussed in Chapter 3. The second reason why we 
find SVMs instructive is that in contrast to Chapter 9, the optimization 
problem for SVM does not admit an analytic solution so that we need to 
resort to a variety of optimization tools introduced in Chapter 7. 

The SVM view of machine learning is subtly different from the max- 
imum likelihood view of Chapter 9. The maximum likelihood view pro- 
poses a model based on a probabilistic view of the data distribution, from 
which an optimization problem is derived. In contrast, the SVM view starts 
by designing a particular function that is to be optimized during training, 
based on geometric intuitions. We have seen something similar already 
in Chapter 10, where we derived PCA from geometric principles. In the 
SVM case, we start by designing a loss function that is to be minimized 
on training data, following the principles of empirical risk minimization 
(Section 8.2). 

Let us derive the optimization problem corresponding to training an 
SVM on example-label pairs. Intuitively, we imagine binary classification 
data, which can be separated by a hyperplane as illustrated in Figure 12.1. 
Here, every example z,, (a vector of dimension 2) is a two-dimensional 
location (a‘) and x)), and the corresponding binary label y,, is one of 
two different symbols (orange cross or blue disc). “Hyperplane” is a word 
that is commonly used in machine learning, and we encountered hyper- 
planes already in Section 2.8. A hyperplane is an affine subspace of di- 
mension D — 1 (if the corresponding vector space is of dimension D). 
The examples consist of two classes (there are two possible labels) that 
have features (the components of the vector representing the example) 
arranged in such a way as to allow us to separate/classify them by draw- 
ing a straight line. 
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Figure 12.1 
Example 2D data, 
illustrating the 
intuition of data 
where we can find a 
linear classifier that 
separates orange 
crosses from blue 
discs. 
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In the following, we formalize the idea of finding a linear separator 
of the two classes. We introduce the idea of the margin and then extend 
linear separators to allow for examples to fall on the “wrong” side, incur- 
ring a classification error. We present two equivalent ways of formalizing 
the SVM: the geometric view (Section 12.2.4) and the loss function view 
(Section 12.2.5). We derive the dual version of the SVM using Lagrange 
multipliers (Section 7.2). The dual SVM allows us to observe a third way 
of formalizing the SVM: in terms of the convex hulls of the examples of 
each class (Section 12.3.2). We conclude by briefly describing kernels and 
how to numerically solve the nonlinear kernel-SVM optimization problem. 


12.1 Separating Hyperplanes 


Given two examples represented as vectors x; and xj, one way to compute 
the similarity between them is using an inner product (#;, #;). Recall from 
Section 3.2 that inner products are closely related to the angle between 
two vectors. The value of the inner product between two vectors depends 
on the length (norm) of each vector. Furthermore, inner products allow 
us to rigorously define geometric concepts such as orthogonality and pro- 
jections. 

The main idea behind many classification algorithms is to represent 
data in R?” and then partition this space, ideally in a way that examples 
with the same label (and no other examples) are in the same partition. 
In the case of binary classification, the space would be divided into two 
parts corresponding to the positive and negative classes, respectively. We 
consider a particularly convenient partition, which is to (linearly) split 
the space into two halves using a hyperplane. Let example x € R? be an 
element of the data space. Consider a function 


f:RP >R (12.2a) 
xr f(x) := (w,x) +), (12.2b) 
parametrized by w € R? and b € R. Recall from Section 2.8 that hy- 


perplanes are affine subspaces. Therefore, we define the hyperplane that 
separates the two classes in our binary classification problem as 


{x € R?” : f(x) =0}. (12.3) 


An illustration of the hyperplane is shown in Figure 12.2, where the 
vector w is a vector normal to the hyperplane and b the intercept. We can 
derive that w is a normal vector to the hyperplane in (12.3) by choosing 
any two examples x, and a, on the hyperplane and showing that the 
vector between them is orthogonal to w. In the form of an equation, 


f(@a) — f(@o) = (w, £a) +b — ((w, £a) +b) (12.4a) 
= (W, £a — To) , (12.4b) 
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w 
wW. 
e Positive 
0 Negative 
(a) Separating hyperplane in 3D (b) Projection of the setting in (a) onto 


a plane 


where the second line is obtained by the linearity of the inner product 
(Section 3.2). Since we have chosen a, and a, to be on the hyperplane, 
this implies that f(x,) = 0 and f(a,) = 0 and hence (w,x, — x,) = 0. 
Recall that two vectors are orthogonal when their inner product is zero. 
Therefore, we obtain that w is orthogonal to any vector on the hyperplane. 


Remark. Recall from Chapter 2 that we can think of vectors in different 
ways. In this chapter, we think of the parameter vector w as an arrow 
indicating a direction, i.e., we consider w to be a geometric vector. In 
contrast, we think of the example vector æ as a data point (as indicated 
by its coordinates), i.e., we consider «x to be the coordinates of a vector 
with respect to the standard basis. > 


When presented with a test example, we classify the example as pos- 
itive or negative depending on the side of the hyperplane on which it 
occurs. Note that (12.3) not only defines a hyperplane; it additionally de- 
fines a direction. In other words, it defines the positive and negative side 
of the hyperplane. Therefore, to classify a test example £testy we calcu- 
late the value of the function f(£rest) and classify the example as +1 if 
f(£rest) > 0 and —1 otherwise. Thinking geometrically, the positive ex- 
amples lie “above” the hyperplane and the negative examples “below” the 
hyperplane. 

When training the classifier, we want to ensure that the examples with 
positive labels are on the positive side of the hyperplane, i.e., 


(w, £n) +b20 when yn = +1 (12.5) 
and the examples with negative labels are on the negative side, i.e., 
(w, £n) +b0<0 when yn =-—l1. (12.6) 


Refer to Figure 12.2 for a geometric intuition of positive and negative 
examples. These two conditions are often presented in a single equation 


Yn((w, en) +b) > 0. (12.7) 
Equation (12.7) is equivalent to (12.5) and (12.6) when we multiply both 
sides of (12.5) and (12.6) with yn = 1 and y,, = —1, respectively. 
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Figure 12.2 
Equation of a 
separating 
hyperplane (12.3). 
(a) The standard 
way of representing 
the equation in 3D. 
(b) For ease of 
drawing, we look at 
the hyperplane edge 
on. 


w is orthogonal to 
any vector on the 
hyperplane. 
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Figure 12.3 
Possible separating 
hyperplanes. There 
are many linear 
classifiers (green 
lines) that separate 
orange crosses from 
blue discs. 











12.2 Primal Support Vector Machine 


Based on the concept of distances from points to a hyperplane, we now 
are in a position to discuss the support vector machine. For a dataset 
{(a1,41),---, (Zw, yn) } that is linearly separable, we have infinitely many 
candidate hyperplanes (refer to Figure 12.3), and therefore classifiers, 
that solve our classification problem without any (training) errors. To find 
a unique solution, one idea is to choose the separating hyperplane that 
maximizes the margin between the positive and negative examples. In 
other words, we want the positive and negative examples to be separated 
A classifier with by a large margin (Section 12.2.1). In the following, we compute the dis- 
large margin turns tance between an example and a hyperplane to derive the margin. Recall 
ont to generalize that the closest point on the hyperplane to a given point (example z,,) is 


well (Steinwart and . ; g P 
Christmann, 2008). | Obtained by the orthogonal projection (Section 3.8). 


12.2.1 Concept of the Margin 


margin The concept of the margin is intuitively simple: It is the distance of the 
There couldbetwo | separating hyperplane to the closest examples in the dataset, assuming 
or more closest that the dataset is linearly separable. However, when trying to formalize 


examples to a 


this distance, there is a technical wrinkle that may be confusing. The tech- 
hyperplane. 


nical wrinkle is that we need to define a scale at which to measure the 
distance. A potential scale is to consider the scale of the data, i.e., the raw 
values of x„. There are problems with this, as we could change the units 
of measurement of a,, and change the values in x,,, and, hence, change 
the distance to the hyperplane. As we will see shortly, we define the scale 
based on the equation of the hyperplane (12.3) itself. 

Consider a hyperplane (w, x) + b, and an example z, as illustrated in 
Figure 12.4. Without loss of generality, we can consider the example x, 
to be on the positive side of the hyperplane, i.e., (w, £a) +b > 0. We 
would like to compute the distance r > 0 of x, from the hyperplane. We 
do so by considering the orthogonal projection (Section 3.8) of x, onto 
the hyperplane, which we denote by æ’. Since w is orthogonal to the 
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Figure 12.4 Vector 
addition to express 
distance to 
hyperplane: 


met f wW 
La = Ta +T Tol: 





hyperplane, we know that the distance r is just a scaling of this vector w. 
If the length of w is known, then we can use this scaling factor r factor 
to work out the absolute distance between a, and æ/. For convenience, 
we choose to use a vector of unit length (its norm is 1) and obtain this 


by dividing w by its norm, Twi" Using vector addition (Section 2.4), we 
obtain 
a ee ee (12.8) 


lwl 


Another way of thinking about r is that it is the coordinate of x, in the 
subspace spanned by w/ ||w||. We have now expressed the distance of x, 
from the hyperplane as r, and if we choose x, to be the point closest to 
the hyperplane, this distance r is the margin. 

Recall that we would like the positive examples to be further than r 
from the hyperplane, and the negative examples to be further than dis- 
tance r (in the negative direction) from the hyperplane. Analogously to 
the combination of (12.5) and (12.6) into (12.7), we formulate this ob- 
jective as 


In other words, we combine the requirements that examples are at least 
r away from the hyperplane (in the positive and negative direction) into 
one single inequality. 

Since we are interested only in the direction, we add an assumption to 


our model that the parameter vector w is of unit length, i-e., ||w|| = 1, 
where we use the Euclidean norm ||w|| = Vw'w (Section 3.1). This We will see other 


assumption also allows a more intuitive interpretation of the distance r choices of inner 


(12.8) since it is the scaling factor of a vector of length 1. přodućts, <. 
(Section 3.2) in 


Remark. A reader familiar with other presentations of the margin would Section 12.4. 
notice that our definition of ||w|| = 1 is different from the standard 
presentation if the SVM was the one provided by Schölkopf and Smola 
(2002), for example. In Section 12.2.3, we will show the equivalence of 
both approaches. ro 


Collecting the three requirements into a single constrained optimization 
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Figure 12.5 


Derivation of the 


margin: r = sad] 


Recall that we 
currently consider 
linearly separable 
data. 
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s 





` 
eLa 
` 


problem, we obtain the objective 


max r 
w,b,r 
margin 


12.1 
subject to yp (lw, 2n) +b) >r, lwll =1, r>0, 7 
eS eas ST 


data fitting normalization 


which says that we want to maximize the margin r while ensuring that 
the data lies on the correct side of the hyperplane. 


Remark. The concept of the margin turns out to be highly pervasive in ma- 
chine learning. It was used by Vladimir Vapnik and Alexey Chervonenkis 
to show that when the margin is large, the “complexity” of the function 
class is low, and hence learning is possible (Vapnik, 2000). It turns out 
that the concept is useful for various different approaches for theoret- 
ically analyzing generalization error (Steinwart and Christmann, 2008; 
Shalev-Shwartz and Ben-David, 2014). © 


12.2.2 Traditional Derivation of the Margin 


In the previous section, we derived (12.10) by making the observation that 
we are only interested in the direction of w and not its length, leading to 
the assumption that ||w|| = 1. In this section, we derive the margin max- 
imization problem by making a different assumption. Instead of choosing 
that the parameter vector is normalized, we choose a scale for the data. 
We choose this scale such that the value of the predictor (w, a) + bis 1 at 
the closest example. Let us also denote the example in the dataset that is 
closest to the hyperplane by za. 

Figure 12.5 is identical to Figure 12.4, except that now we rescaled the 
axes, such that the example z, lies exactly on the margin, i.e., (W, £a) + 
b = 1. Since x’, is the orthogonal projection of £z, onto the hyperplane, it 
must by definition lie on the hyperplane, i.e., 


(w, x) +b=0. (12.11) 
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By substituting (12.8) into (12.11), we obtain 





(wearer) +b=0. (12.12) 
lwll 
Exploiting the bilinearity of the inner product (see Section 3.2), we get 
aiee k (12.13) 
lwll 


Observe that the first term is 1 by our assumption of scale, i.e., (w, £a) + 
b = 1. From (3.16) in Section 3.1, we know that (w, w) = ||w||?. Hence, 
the second term reduces to r||w||. Using these simplifications, we obtain 


r= eee : (12.14) 
|| w || 
This means we derived the distance r in terms of the normal vector w 
of the hyperplane. At first glance, this equation is counterintuitive as we 
seem to have derived the distance from the hyperplane in terms of the 
length of the vector w, but we do not yet know this vector. One way to 
think about it is to consider the distance r to be a temporary variable 
that we only use for this derivation. Therefore, for the rest of this section 
we will denote the distance to the hyperplane by Tel" In Section 12.2.3, 
we will see that the choice that the margin equals 1 is equivalent to our 
previous assumption of ||w|| = 1 in Section 12.2.1. 
Similar to the argument to obtain (12.9), we want the positive and 
negative examples to be at least 1 away from the hyperplane, which yields 
the condition 


Yn((w, tn) +b) >1. (12.15) 


Combining the margin maximization with the fact that examples need to 
be on the correct side of the hyperplane (based on their labels) gives us 


1 
max — (12.16) 
me Tol 
subject to yn((w, an) +b) >1 forall n=1,...,N. (12.17) 


Instead of maximizing the reciprocal of the norm as in (12.16), we often 
minimize the squared norm. We also often include a constant 5 that does 
not affect the optimal w, b but yields a tidier form when we compute the 
gradient. Then, our objective becomes 


1 
min =||2|| (12.18) 
w,b 2 


subject to yn( (w, £n) +b) > 1 forall n=1,...,N. (12.19) 


Equation (12.18) is known as the hard margin SVM. The reason for the 
expression “hard” is because the formulation does not allow for any vi- 
olations of the margin condition. We will see in Section 12.2.4 that this 
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We can also think of 
the distance as the 
projection error that 
incurs when 
projecting £a onto 
the hyperplane. 


The squared norm 
results in a convex 
quadratic 
programming 
problem for the 
SVM (Section 12.5). 


hard margin SVM 


Note that r > 0 
because we 
assumed linear 
separability, and 
hence there is no 


issue to divide by r. 
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“hard” condition can be relaxed to accommodate violations if the data is 
not linearly separable. 


12.2.3 Why We Can Set the Margin to 1 


In Section 12.2.1, we argued that we would like to maximize some value 
r, which represents the distance of the closest example to the hyperplane. 
In Section 12.2.2, we scaled the data such that the closest example is of 
distance 1 to the hyperplane. In this section, we relate the two derivations, 
and show that they are equivalent. 


Theorem 12.1. Maximizing the margin r, where we consider normalized 
weights as in (12.10), 


max r 
a margin (12 20) 
subject to yn((w,an) +b) er, |lwi|=1, r>0, : 

-e—~_——’ SY 
data fitting normalization 
is equivalent to scaling the data, such that the margin is unity: 
Ros. &L 2 
min — ||w|| 
w,b 2 
T 
margin (12.21) 


subject to Yyn((w, £n) +b) 21. 
a 
data fitting 


Proof Consider (12.20). Since the square is a strictly monotonic trans- 
formation for non-negative arguments, the maximum stays the same if we 
consider r? in the objective. Since ||w|| = 1 we can reparametrize the 
equation with a new weight vector w’ that is not normalized by explicitly 
using Ter We obtain 


max r? 


tw br 
/ 


a (12.22) 
subject to Yn (Cee) + p) >r, r>0. 
w 


Equation (12.22) explicitly states that the distance r is positive. Therefore, 
we can divide the first constraint by r, which yields 


max r? 


w',b,r 


(12.23) 


w' 


subject to Yn En ) + 


lwr 


w” 
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renaming the parameters to w” and b”. Since w” = iar rearranging for 
r gives 
w 1 w’ 1 
(wlr r (w r 
By substituting this result into (12.23), we obtain 
1 
Wop 1 2 
w”, lw” || (12.25) 
subject to yn ((w”, £n) +b") 21. 


The final step is to observe that maximizing TA yields the same solution 
eile 











as minimizing } ||w’"||", which concludes the proof of Theorem 12.1. 





12.2.4 Soft Margin SVM: Geometric View 


In the case where data is not linearly separable, we may wish to allow 
some examples to fall within the margin region, or even to be on the 
wrong side of the hyperplane as illustrated in Figure 12.6. 

The model that allows for some classification errors is called the soft 
margin SVM. In this section, we derive the resulting optimization problem 
using geometric arguments. In Section 12.2.5, we will derive an equiv- 
alent optimization problem using the idea of a loss function. Using La- 
grange multipliers (Section 7.2), we will derive the dual optimization 
problem of the SVM in Section 12.3. This dual optimization problem al- 
lows us to observe a third interpretation of the SVM: as a hyperplane that 
bisects the line between convex hulls corresponding to the positive and 
negative data examples (Section 12.3.2). 

The key geometric idea is to introduce a slack variable €,, corresponding 
to each example-label pair (x,,, y,,) that allows a particular example to be 
within the margin or even on the wrong side of the hyperplane (refer to 
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Figure 12.6 

(a) Linearly 
separable and 
(b) non-linearly 
separable data. 


soft margin SVM 


slack variable 


Figure 12.7 Soft 
margin SVM allows 
examples to be 
within the margin or 
on the wrong side of 
the hyperplane. The 
slack variable € 
measures the 
distance of a 
positive example 

a4 to the positive 
margin hyperplane 
(w,x)+b=1 
when a+ is on the 
wrong side. 


soft margin SVM 


regularization 
parameter 


regularizer 


There are 
alternative 
parametrizations of 
this regularization, 
which is 

why (12.26a) is also 
often referred to as 
the C-SVM. 
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Figure 12.7). We subtract the value of €,, from the margin, constraining 
En to be non-negative. To encourage correct classification of the samples, 
we add €,, to the objective 


pee He ae oe 
min 5 [wl] +C > En (12.26a) 
subject to yn((w, £n) +b) > 1-— ĉn (12.26b) 
En > 0 (12.26c) 
forn = 1,...,N. In contrast to the optimization problem (12.18) for the 


hard margin SVM, this one is called the soft margin SVM. The parameter 
C > 0 trades off the size of the margin and the total amount of slack that 
we have. This parameter is called the regularization parameter since, as 
we will see in the following section, the margin term in the objective func- 
tion (12.26a) is a regularization term. The margin term ||w||? is called 
the regularizer, and in many books on numerical optimization, the reg- 
ularization parameter is multiplied with this term (Section 8.2.3). This 
is in contrast to our formulation in this section. Here a large value of C 
implies low regularization, as we give the slack variables larger weight, 
hence giving more priority to examples that do not lie on the correct side 
of the margin. 


Remark. In the formulation of the soft margin SVM (12.26a) w is reg- 
ularized, but b is not regularized. We can see this by observing that the 
regularization term does not contain b. The unregularized term b com- 
plicates theoretical analysis (Steinwart and Christmann, 2008, chapter 1) 
and decreases computational efficiency (Fan et al., 2008). & 


12.2.5 Soft Margin SVM: Loss Function View 


Let us consider a different approach for deriving the SVM, following the 
principle of empirical risk minimization (Section 8.2). For the SVM, we 
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choose hyperplanes as the hypothesis class, that is 
f(x) = (w, x) +b. (12.27) 


We will see in this section that the margin corresponds to the regulariza- 
tion term. The remaining question is, what is the loss function? In con- 
trast to Chapter 9, where we consider regression problems (the output 
of the predictor is a real number), in this chapter, we consider binary 
classification problems (the output of the predictor is one of two labels 
{+1, —1}). Therefore, the error/loss function for each single example- 
label pair needs to be appropriate for binary classification. For example, 
the squared loss that is used for regression (9.10b) is not suitable for bi- 
nary classification. 


Remark. The ideal loss function between binary labels is to count the num- 
ber of mismatches between the prediction and the label. This means that 
for a predictor f applied to an example z,,, we compare the output f(z,,) 
with the label y,,. We define the loss to be zero if they match, and one if 
they do not match. This is denoted by 1(f(an) # yn) and is called the 
zero-one loss. Unfortunately, the zero-one loss results in a combinatorial 
optimization problem for finding the best parameters w, b. Combinatorial 
optimization problems (in contrast to continuous optimization problems 
discussed in Chapter 7) are in general more challenging to solve. ro 


What is the loss function corresponding to the SVM? Consider the error 
between the output of a predictor f(a,,) and the label y,,. The loss de- 
scribes the error that is made on the training data. An equivalent way to 
derive (12.26a) is to use the hinge loss 


é(t) = max{0,1—t} where t=yf(x)=y((w,x) +b). (12.28) 


If f(x) is on the correct side (based on the corresponding label y) of the 
hyperplane, and further than distance 1, this means that t > 1 and the 
hinge loss returns a value of zero. If f(a) is on the correct side but too 
close to the hyperplane (0 < t < 1), the example z is within the margin, 
and the hinge loss returns a positive value. When the example is on the 
wrong side of the hyperplane (t < 0), the hinge loss returns an even larger 
value, which increases linearly. In other words, we pay a penalty once we 
are closer than the margin to the hyperplane, even if the prediction is 
correct, and the penalty increases linearly. An alternative way to express 
the hinge loss is by considering it as two linear pieces 


0 if t21 
L(t) = ; 12.29 

(1) f —t if t<1 : : 
as illustrated in Figure 12.8. The loss corresponding to the hard margin 
SVM 12.18 is defined as 


é(t) = (12.30) 


0 if t21 
œo if t<1` 
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Figure 12.8 The 
hinge loss is a 
convex upper bound 
of zero-one loss. 
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4 

a — Zero-one loss 
| —— Hinge loss 

a2 
= 

0 

—2 0 2 
t 


This loss can be interpreted as never allowing any examples inside the 
margin. 

For a given training set {(a,,y1),...,(a@w,yn)}, we seek to minimize 
the total loss, while regularizing the objective with /,-regularization (see 
Section 8.2.3). Using the hinge loss (12.28) gives us the unconstrained 
optimization problem 





N 
1 
min —||2w||? + cy. max{0,1— yn((w, an) + 6)} . (12.31) 
ue 2 n=1 
regularizer 


error term 


The first term in (12.31) is called the regularization term or the regularizer 
(see Section 8.2.3), and the second term is called the loss term or the error 
term. Recall from Section 12.2.4 that the term } || ||" arises directly from 
the margin. In other words, margin maximization can be interpreted as 
regularization. 

In principle, the unconstrained optimization problem in (12.31) can 
be directly solved with (sub-)gradient descent methods as described in 
Section 7.1. To see that (12.31) and (12.26a) are equivalent, observe that 
the hinge loss (12.28) essentially consists of two linear parts, as expressed 
in (12.29). Consider the hinge loss for a single example-label pair (12.28). 
We can equivalently replace minimization of the hinge loss over t with a 
minimization of a slack variable € with two constraints. In equation form, 


min max{0,1-— t} (12.32) 
is equivalent to 
min € 
oF (12.33) 


subject to €>0, €S>1-t. 
By substituting this expression into (12.31) and rearranging one of the 
constraints, we obtain exactly the soft margin SVM (12.26a). 


Remark. Let us contrast our choice of the loss function in this section to the 
loss function for linear regression in Chapter 9. Recall from Section 9.2.1 
that for finding maximum likelihood estimators, we usually minimize the 
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negative log-likelihood. Furthermore, since the likelihood term for linear 
regression with Gaussian noise is Gaussian, the negative log-likelihood for 
each example is a squared error function. The squared error function is the 
loss function that is minimized when looking for the maximum likelihood 
solution. © 


12.3 Dual Support Vector Machine 


The description of the SVM in the previous sections, in terms of the vari- 
ables w and b, is known as the primal SVM. Recall that we consider inputs 
x € R? with D features. Since w is of the same dimension as æ, this 
means that the number of parameters (the dimension of w) of the opti- 
mization problem grows linearly with the number of features. 

In the following, we consider an equivalent optimization problem (the 
so-called dual view), which is independent of the number of features. In- 
stead, the number of parameters increases with the number of examples 
in the training set. We saw a similar idea appear in Chapter 10, where we 
expressed the learning problem in a way that does not scale with the num- 
ber of features. This is useful for problems where we have more features 
than the number of examples in the training dataset. The dual SVM also 
has the additional advantage that it easily allows kernels to be applied, 
as we shall see at the end of this chapter. The word “dual” appears often 
in mathematical literature, and in this particular case it refers to convex 
duality. The following subsections are essentially an application of convex 
duality, which we discussed in Section 7.2. 


12.3.1 Convex Duality via Lagrange Multipliers 


Recall the primal soft margin SVM (12.26a). We call the variables w, b, 
and € corresponding to the primal SVM the primal variables. We use a,, > 
0 as the Lagrange multiplier corresponding to the constraint (12.26b) that 
the examples are classified correctly and y,, > 0 as the Lagrange multi- 
plier corresponding to the non-negativity constraint of the slack variable; 
see (12.26c). The Lagrangian is then given by 





1 N 
£(w, b, £,0,7) = sllwll?+C Df, (12.34) 
n=1 
N N 
— SF an (Yn ((w, @n) +b) 1 es) -X nên 
n=1 & n=1 
constraint (12.26b) constraint (12.26c) 
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In Chapter 7, we 
used A as Lagrange 
multipliers. In this 
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the notation 
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SVM literature, and 
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representer theorem 


The representer 
theorem is actually 
a collection of 
theorems saying 
that the solution of 
minimizing 
empirical risk lies in 
the subspace 
(Section 2.4.3) 
defined by the 
examples. 


support vector 
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By differentiating the Lagrangian (12.34) with respect to the three primal 
variables w, b, and € respectively, we obtain 


N 
~ = wl —Soanyntn', (12.35) 
n=1 
ag x 
aE =~ Do eat (12.36) 
oL 
“Z Ooa 12. 
JE, O (12.37) 


We now find the maximum of the Lagrangian by setting each of these 
partial derivatives to zero. By setting (12.35) to zero, we find 


N 
w = X AnYnLn , 
n=1 


which is a particular instance of the representer theorem (Kimeldorf and 
Wahba, 1970). Equation (12.38) states that the optimal weight vector in 
the primal is a linear combination of the examples z,,. Recall from Sec- 
tion 2.6.1 that this means that the solution of the optimization problem 
lies in the span of training data. Additionally, the constraint obtained by 
setting (12.36) to zero implies that the optimal weight vector is an affine 
combination of the examples. The representer theorem turns out to hold 
for very general settings of regularized empirical risk minimization (Hof- 
mann et al., 2008; Argyriou and Dinuzzo, 2014). The theorem has more 
general versions (Schélkopf et al., 2001), and necessary and sufficient 
conditions on its existence can be found in Yu et al. (2013). 


(12.38) 


Remark. The representer theorem (12.38) also provides an explanation 
of the name “support vector machine.” The examples x,,, for which the 
corresponding parameters a,, = 0, do not contribute to the solution w at 
all. The other examples, where a, > 0, are called support vectors since 
they “support” the hyperplane. > 


By substituting the expression for w into the Lagrangian (12.34), we 
obtain the dual 


IAN N N 
D(E,a, 7) = 9 a. Yiyi (Er Bj} — X yia: a YjAj Lj, z.) 
i=1 j=1 


ka N N N N 
+ OD & = bd viai +) a; -o -J ng. 
i=1 i=1 i=1 i=1 i=1 (12.39) 


Note that there are no longer any terms involving the primal variable w. 
By setting (12.36) to zero, we obtain Si YnOn = 0. Therefore, the term 
involving b also vanishes. Recall that inner products are symmetric and 
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bilinear (see Section 3.2). Therefore, the first two terms in (12.39) are 
over the same objects. These terms (colored blue) can be simplified, and 
we obtain the Lagrangian 


N N 
D(E, a, y) “FL Lomo, (24,23) + So ai F D(C =e Es 


i=1 j= i=1 

(12.40) 
The last term in this equation is a collection of all terms that contain slack 
variables €;. By setting (12.37) to zero, we see that the last term in (12.40) 
is also zero. Furthermore, by using the same equation and recalling that 
the Lagrange multiplers y; are non-negative, we conclude that a; < C. 
We now obtain the dual optimization problem of the SVM, which is ex- 
pressed exclusively in terms of the Lagrange multipliers a;. Recall from 
Lagrangian duality (Definition 7.1) that we maximize the dual problem. 
This is equivalent to minimizing the negative dual problem, such that we 
end up with the dual SVM 


INJ 
min py Lath iYjp Lj; (5, 25) -La 

N (12.41) 
subject to 


2 Yia; = 0 
i=1 


O<a;<C forall i=1,..., N. 


The equality constraint in (12.41) is obtained from setting (12.36) to 
zero. The inequality constraint a; > 0 is the condition imposed on La- 
grange multipliers of inequality constraints (Section 7.2). The inequality 
constraint œ; < C is discussed in the previous paragraph. 

The set of inequality constraints in the SVM are called “box constraints” 
because they limit the vector a@ = [a1,--- ,ay]' € R of Lagrange mul- 
tipliers to be inside the box defined by 0 and C’ on each axis. These 
axis-aligned boxes are particularly efficient to implement in numerical 
solvers (Dostal, 2009, chapter 5). 

Once we obtain the dual parameters a, we can recover the primal pa- 
rameters w by using the representer theorem (12.38). Let us call the op- 
timal primal parameter w*. However, there remains the question on how 
to obtain the parameter b*. Consider an example z,, that lies exactly on 
the margin’s boundary, i.e., (w*, x,,) + b = y,,. Recall that y,, is either +1 
or —1. Therefore, the only unknown is b, which can be computed by 

b* = Yn — (w*, dn). (12.42) 
Remark. In principle, there may be no examples that lie exactly on the 
margin. In this case, we should compute |y,, — (w*, x,,) | for all support 
vectors and take the median value of this absolute value difference to be 
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dual SVM 


It turns out that 
examples that lie 
exactly on the 
margin are 
examples whose 
dual parameters lie 
strictly inside the 
box constraints, 

0 < a; < C. This is 
derived using the 
Karush Kuhn Tucker 
conditions, for 
example in 
Schölkopf and 
Smola (2002). 


Figure 12.9 Convex 
hulls. (a) Convex 
hull of points, some 
of which lie within 
the boundary; 

(b) convex hulls 
around positive and 
negative examples. 
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(a) Convex hull. (b) Convex hulls around positive (blue) and 


negative (orange) examples. The distance be- 
tween the two convex sets is the length of the 
difference vector c — d. 


the value of b*. A derivation of this can be found in http: //fouryears. 
eu/2012/06/07/the-svm-bias-term-conspiracy/. ® 


12.3.2 Dual SVM: Convex Hull View 


Another approach to obtain the dual SVM is to consider an alternative 
geometric argument. Consider the set of examples x,, with the same label. 
We would like to build a convex set that contains all the examples such 
that it is the smallest possible set. This is called the convex hull and is 
illustrated in Figure 12.9. 

Let us first build some intuition about a convex combination of points. 
Consider two points xı and £, and corresponding non-negative weights 
Q1,Q@_ > 0 such that aı +a = 1. The equation a1 £1 +æ2£2 describes each 
point on a line between x, and a2. Consider what happens when we add 
a third point æ along with a weight a3; > 0 such that °_,a, = 1. 
The convex combination of these three points 2,,%2,x%3 spans a two- 
dimensional area. The convex hull of this area is the triangle formed by 
the edges corresponding to each pair of of points. As we add more points, 
and the number of points becomes greater than the number of dimen- 
sions, some of the points will be inside the convex hull, as we can see in 
Figure 12.9(a). 

In general, building a convex convex hull can be done by introducing 
non-negative weights a,, > 0 corresponding to each example æn. Then 
the convex hull can be described as the set 


N N 
conv (X) = DA with y Q,=1 and a, 20, (12.43) 
n=1 


n=1 
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for all n = 1,...,N. If the two clouds of points corresponding to the 
positive and negative classes are separated, then the convex hulls do not 
overlap. Given the training data (a1, y,),...,(Xw, yn), we form two con- 
vex hulls, corresponding to the positive and negative classes respectively. 
We pick a point c, which is in the convex hull of the set of positive exam- 
ples, and is closest to the negative class distribution. Similarly, we pick a 
point d in the convex hull of the set of negative examples and is closest to 
the positive class distribution; see Figure 12.9(b). We define a difference 
vector between d and c as 


wi=c-—d. (12.44) 


Picking the points c and d as in the preceding cases, and requiring them 
to be closest to each other is equivalent to minimizing the length/norm of 
w, so that we end up with the corresponding optimization problem 


1 
arg min ||w|| = arg min 5 lwll? . (12.45) 


Since c must be in the positive convex hull, it can be expressed as a convex 
combination of the positive examples, i.e., for non-negative coefficients 
an 

eS: Satay: (12.46) 


Nn:Yn=+1 


In (12.46), we use the notation n : y,, = +1 to indicate the set of indices 
n for which y,, = +1. Similarly, for the examples with negative labels, we 
obtain 

d= J ager: (12.47) 

N: Yn =—1 

By substituting (12.44), (12.46), and (12.47) into (12.45), we obtain the 
objective 
2 


gak 
min — 
a 


(12.48) 














+ = 
> Ap, En — X An, En 
1 


N:Yn=+1 N: Yn=— 


Let a be the set of all coefficients, i.e., the concatenation of œa* and a7. 
Recall that we require that for each convex hull that their coefficients sum 
to one, 


yael Se aS. (12.49) 
1 


NYyn=tl N:Yn=— 


This implies the constraint 


N 
tintin =: (12.50) 
n=1 
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kernel function can 
be very general and 
are not necessarily 
restricted to RP. 
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This result can be seen by multiplying out the individual classes 


N 
Syndn = X Dag SY> (lay (12.51a) 
n=1 H1 1 


N: Yn=Ṣ1 N:Yn=— 


>X a- J œ =l-l=0. (12.51b) 
1 


N: Yn=+1 N:Yn =— 





The objective function (12.48) and the constraint (12.50), along with the 
assumption that a > 0, give us a constrained (convex) optimization prob- 
lem. This optimization problem can be shown to be the same as that of 
the dual hard margin SVM (Bennett and Bredensteiner, 2000a). 


Remark. To obtain the soft margin dual, we consider the reduced hull. The 
reduced hull is similar to the convex hull but has an upper bound to the 
size of the coefficients œ. The maximum possible value of the elements 
of a restricts the size that the convex hull can take. In other words, the 
bound on @ shrinks the convex hull to a smaller volume (Bennett and 
Bredensteiner, 2000b). ©% 


12.4 Kernels 


Consider the formulation of the dual SVM (12.41). Notice that the in- 
ner product in the objective occurs only between examples x; and 2;. 
There are no inner products between the examples and the parameters. 
Therefore, if we consider a set of features @(x;) to represent 2;, the only 
change in the dual SVM will be to replace the inner product. This mod- 
ularity, where the choice of the classification method (the SVM) and the 
choice of the feature representation @(a) can be considered separately, 
provides flexibility for us to explore the two problems independently. In 
this section, we discuss the representation @(ax) and briefly introduce the 
idea of kernels, but do not go into the technical details. 

Since @(a) could be a non-linear function, we can use the SVM (which 
assumes a linear classifier) to construct classifiers that are nonlinear in 
the examples x,,. This provides a second avenue, in addition to the soft 
margin, for users to deal with a dataset that is not linearly separable. It 
turns out that there are many algorithms and statistical methods that have 
this property that we observed in the dual SVM: the only inner products 
are those that occur between examples. Instead of explicitly defining a 
non-linear feature map ¢(-) and computing the resulting inner product 
between examples x; and x,, we define a similarity function k(a;, 7;) be- 
tween x; and «,. For a certain class of similarity functions, called kernels, 
the similarity function implicitly defines a non-linear feature map ¢(-). 
Kernels are by definition functions k : ¥ x X — R for which there exists 
a Hilbert space H and @: ¥ — H a feature map such that 


k(x, xj) = (P(wi), (Lj) a, - (12.52) 
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Second feature 
Second feature 


First feature 





First feature 


(a) SVM with linear kernel (b) SVM with RBF kernel 


Second feature 
Second feature 





First feature First feature 


(c) SVM with polynomial (degree 2) kernel (d) SVM with polynomial (degree 3) kernel 


There is a unique reproducing kernel Hilbert space associated with every 
kernel k (Aronszajn, 1950; Berlinet and Thomas-Agnan, 2004). In this 
unique association, @(a) = k(-,a) is called the canonical feature map. 
The generalization from an inner product to a kernel function (12.52) is 
known as the kernel trick (Sch6lkopf and Smola, 2002; Shawe-Taylor and 
Cristianini, 2004), as it hides away the explicit non-linear feature map. 
The matrix kK € R‘*N, resulting from the inner products or the appli- 
cation of k(-,-) to a dataset, is called the Gram matrix, and is often just 
referred to as the kernel matrix. Kernels must be symmetric and positive 
semidefinite functions so that every kernel matrix K is symmetric and 


positive semidefinite (Section 3.2.3): 
VzeE RN :2'Kz>0. (12.53) 


Some popular examples of kernels for multivariate real-valued data x; € 
R? are the polynomial kernel, the Gaussian radial basis function kernel, 
and the rational quadratic kernel (Scholkopf and Smola, 2002; Rasmussen 
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Figure 12.10 SVM 
with different 
kernels. Note that 
while the decision 
boundary is 
nonlinear, the 
underlying problem 
being solved is for a 
linear separating 
hyperplane (albeit 
with a nonlinear 
kernel). 
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Gram matrix 
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The choice of 
kernel, as well as 
the parameters of 
the kernel, is often 
chosen using nested 
cross-validation 
(Section 8.6.1). 
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and Williams, 2006). Figure 12.10 illustrates the effect of different kernels 
on separating hyperplanes on an example dataset. Note that we are still 
solving for hyperplanes, that is, the hypothesis class of functions are still 
linear. The non-linear surfaces are due to the kernel function. 


Remark. Unfortunately for the fledgling machine learner, there are mul- 
tiple meanings of the word “kernel.” In this chapter, the word “kernel” 
comes from the idea of the reproducing kernel Hilbert space (RKHS) (Aron- 
szajn, 1950; Saitoh, 1988). We have discussed the idea of the kernel in lin- 
ear algebra (Section 2.7.3), where the kernel is another word for the null 
space. The third common use of the word “kernel” in machine learning is 
the smoothing kernel in kernel density estimation (Section 11.5). ro 


Since the explicit representation @(x) is mathematically equivalent to 
the kernel representation k(a#;,#,;), a practitioner will often design the 
kernel function such that it can be computed more efficiently than the 
inner product between explicit feature maps. For example, consider the 
polynomial kernel (Scholkopf and Smola, 2002), where the number of 
terms in the explicit expansion grows very quickly (even for polynomials 
of low degree) when the input dimension is large. The kernel function 
only requires one multiplication per input dimension, which can provide 
significant computational savings. Another example is the Gaussian ra- 
dial basis function kernel (Schélkopf and Smola, 2002; Rasmussen and 
Williams, 2006), where the corresponding feature space is infinite dimen- 
sional. In this case, we cannot explicitly represent the feature space but 
can still compute similarities between a pair of examples using the kernel. 

Another useful aspect of the kernel trick is that there is no need for 
the original data to be already represented as multivariate real-valued 
data. Note that the inner product is defined on the output of the function 
o(-), but does not restrict the input to real numbers. Hence, the function 
@(-) and the kernel function k(-,-) can be defined on any object, e.g., 
sets, sequences, strings, graphs, and distributions (Ben-Hur et al., 2008; 
Gartner, 2008; Shi et al., 2009; Sriperumbudur et al., 2010; Vishwanathan 
et al., 2010). 


12.5 Numerical Solution 


We conclude our discussion of SVMs by looking at how to express the 
problems derived in this chapter in terms of the concepts presented in 
Chapter 7. We consider two different approaches for finding the optimal 
solution for the SVM. First we consider the loss view of SVM 8.2.2 and ex- 
press this as an unconstrained optimization problem. Then we express the 
constrained versions of the primal and dual SVMs as quadratic programs 
in standard form 7.3.2. 

Consider the loss function view of the SVM (12.31). This is a convex 
unconstrained optimization problem, but the hinge loss (12.28) is not dif- 
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ferentiable. Therefore, we apply a subgradient approach for solving it. 
However, the hinge loss is differentiable almost everywhere, except for 
one single point at the hinge t = 1. At this point, the gradient is a set of 
possible values that lie between 0 and —1. Therefore, the subgradient g of 
the hinge loss is given by 


zi t<1 
g(t) = ¢ [-1,0] t=1. (12.54) 
0 t>1 


Using this subgradient, we can apply the optimization methods presented 
in Section 7.1. 

Both the primal and the dual SVM result in a convex quadratic pro- 
gramming problem (constrained optimization). Note that the primal SVM 
in (12.26a) has optimization variables that have the size of the dimen- 
sion D of the input examples. The dual SVM in (12.41) has optimization 
variables that have the size of the number N of examples. 

To express the primal SVM in the standard form (7.45) for quadratic 
programming, let us assume that we use the dot product (3.5) as the 
inner product. We rearrange the equation for the primal SVM (12.26a), 
such that the optimization variables are all on the right and the inequality 
of the constraint matches the standard form. This yields the optimization 


n A piss eee 
min lwl +O én 


n=l (12.55) 
z 
: Yn., W =j Ynb a En < —1 
subject to Soe, 
n=1,...,N. By concatenating the variables w, b, x,, into a single vector, 


and carefully collecting the terms, we obtain the following matrix form of 
the soft margin SVM: 


F 

w w w 

1 

min = b | Ip eaa b + [0p411 ial b 
wb 2 On+1,D ON, N41 ' ' 

£ c g 

w 

subject to re Y e) b| < o ; 

On, p41 -Iy ¿ ON, 

(12.56) 


In the preceding optimization problem, the minimization is over the pa- 
rameters [w',b,€']' € RP+!+Y, and we use the notation: Im to rep- 
resent the identity matrix of size m x m, Om,n to represent the matrix 
of zeros of size m x n, and 1,,,, to represent the matrix of ones of size 
m x n. In addition, y is the vector of labels [y,,--- ,yw]', Y = diag(y) 
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is an N by N matrix where the elements of the diagonal are from y, and 
X € R%~” is the matrix obtained by concatenating all the examples. 

We can similarly perform a collection of terms for the dual version of the 
SVM (12.41). To express the dual SVM in standard form, we first have to 
express the kernel matrix K such that each entry is K;; = k(æ;, æj). If we 
have an explicit feature representation x; then we define K;; = (x;,x;). 
For convenience of notation we introduce a matrix with zeros everywhere 
except on the diagonal, where we store the labels, that is, Y = diag(y). 
The dual SVM can be written as 


1 
min 50 YKYa = 1),a 


yt 
an 0 (12.57) 
. = < |~N+21 
subject to a ac ai : 
In 


Remark. In Sections 7.3.1 and 7.3.2, we introduced the standard forms 
of the constraints to be inequality constraints. We will express the dual 
SVM’s equality constraint as two inequality constraints, i.e., 


Ax=b isreplacedby Axvw<b and Aa>b. (12.58) 


Particular software implementations of convex optimization methods may 
provide the ability to express equality constraints. Q% 


Since there are many different possible views of the SVM, there are 
many approaches for solving the resulting optimization problem. The ap- 
proach presented here, expressing the SVM problem in standard convex 
optimization form, is not often used in practice. The two main implemen- 
tations of SVM solvers are Chang and Lin (2011) (which is open source) 
and Joachims (1999). Since SVMs have a clear and well-defined optimiza- 
tion problem, many approaches based on numerical optimization tech- 
niques (Nocedal and Wright, 2006) can be applied (Shawe-Taylor and 
Sun, 2011). 


12.6 Further Reading 


The SVM is one of many approaches for studying binary classification. 
Other approaches include the perceptron, logistic regression, Fisher dis- 
criminant, nearest neighbor, naive Bayes, and random forest (Bishop, 2006; 
Murphy, 2012). A short tutorial on SVMs and kernels on discrete se- 
quences can be found in Ben-Hur et al. (2008). The development of SVMs 
is closely linked to empirical risk minimization, discussed in Section 8.2. 
Hence, the SVM has strong theoretical properties (Vapnik, 2000; Stein- 
wart and Christmann, 2008). The book about kernel methods (Schölkopf 
and Smola, 2002) includes many details of support vector machines and 
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how to optimize them. A broader book about kernel methods (Shawe- 
Taylor and Cristianini, 2004) also includes many linear algebra approaches 
for different machine learning problems. 

An alternative derivation of the dual SVM can be obtained using the 
idea of the Legendre—Fenchel transform (Section 7.3.3). The derivation 
considers each term of the unconstrained formulation of the SVM (12.31) 
separately and calculates their convex conjugates (Rifkin and Lippert, 
2007). Readers interested in the functional analysis view (also the reg- 
ularization methods view) of SVMs are referred to the work by Wahba 
(1990). Theoretical exposition of kernels (Aronszajn, 1950; Schwartz, 
1964; Saitoh, 1988; Manton and Amblard, 2015) requires a basic ground- 
ing in linear operators (Akhiezer and Glazman, 1993). The idea of kernels 
have been generalized to Banach spaces (Zhang et al., 2009) and Krein 
spaces (Ong et al., 2004; Loosli et al., 2016). 

Observe that the hinge loss has three equivalent representations, as 
shown in (12.28) and (12.29), as well as the constrained optimization 
problem in (12.33). The formulation (12.28) is often used when compar- 
ing the SVM loss function with other loss functions (Steinwart, 2007). 
The two-piece formulation (12.29) is convenient for computing subgra- 
dients, as each piece is linear. The third formulation (12.33), as seen 
in Section 12.5, enables the use of convex quadratic programming (Sec- 
tion 7.3.2) tools. 

Since binary classification is a well-studied task in machine learning, 
other words are also sometimes used, such as discrimination, separation, 
and decision. Furthermore, there are three quantities that can be the out- 
put of a binary classifier. First is the output of the linear function itself 
(often called the score), which can take any real value. This output can be 
used for ranking the examples, and binary classification can be thought 
of as picking a threshold on the ranked examples (Shawe-Taylor and Cris- 
tianini, 2004). The second quantity that is often considered the output 
of a binary classifier is the output determined after it is passed through 
a non-linear function to constrain its value to a bounded range, for ex- 
ample in the interval [0,1]. A common non-linear function is the sigmoid 
function (Bishop, 2006). When the non-linearity results in well-calibrated 
probabilities (Gneiting and Raftery, 2007; Reid and Williamson, 2011), 
this is called class probability estimation. The third output of a binary 
classifier is the final binary decision {+1, —1}, which is the one most com- 
monly assumed to be the output of the classifier. 

The SVM is a binary classifier that does not naturally lend itself to a 
probabilistic interpretation. There are several approaches for converting 
the raw output of the linear function (the score) into a calibrated class 
probability estimate (P(Y = 1|X = a)) that involve an additional cal- 
ibration step (Platt, 2000; Zadrozny and Elkan, 2001; Lin et al., 2007). 
From the training perspective, there are many related probabilistic ap- 
proaches. We mentioned at the end of Section 12.2.5 that there is a re- 
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lationship between loss function and the likelihood (also compare Sec- 
tions 8.2 and 8.3). The maximum likelihood approach corresponding to 
a well-calibrated transformation during training is called logistic regres- 
sion, which comes from a class of methods called generalized linear mod- 
els. Details of logistic regression from this point of view can be found in 
Agresti (2002, chapter 5) and McCullagh and Nelder (1989, chapter 4). 
Naturally, one could take a more Bayesian view of the classifier output by 
estimating a posterior distribution using Bayesian logistic regression. The 
Bayesian view also includes the specification of the prior, which includes 
design choices such as conjugacy (Section 6.6.1) with the likelihood. Ad- 
ditionally, one could consider latent functions as priors, which results in 
Gaussian process classification (Rasmussen and Williams, 2006, chapter 
3). 
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